公共のデータベースからメタデータと生のFastQファイルを取得するnf-coreのfetchngs

2021 11/11 ツイート追加

　nf-core/fetchfastqは、公共のデータベースからメタデータと生のFastQファイルを取得するバイオインフォマティクス・パイプラインである。現在、このパイプラインはSRA / ENA / GEOのIDをサポートしている（使用方法のドキュメントを参照）。

パイプラインは、複数の計算インフラでタスクを実行するためのワークフローツールであるNextflowを使用して構築されており、非常にポータブルな方法になっている。NextflowはDocker/Singularityコンテナを使用しているため、インストールが簡単で、結果の再現性が高い特徴を持つ。Nextflow DSL2によるパイプラインの実装では、プロセスごとに1つのコンテナを使用しているため、ソフトウェアの依存関係の維持・更新が非常に容易になっている。

　リリース時には、自動化された継続的統合テストにより、AWS クラウド・インフラストラクチャ上のフルサイズのデータセットでパイプラインを実行する。これにより、パイプラインがAWS上で動作すること、実世界のデータセット上で動作するための適切なリソース配分のデフォルト設定がなされていること、パイプラインのリリースと他の分析ソースとの間でベンチマークを行うための結果の永続的な保存が可能であることが保証される。フルサイズのテストで得られた結果は、nf-coreのウェブサイトで見ることができる。

Pipeline release! nf-core/fetchngs v1.6 (Pipeline to fetch metadata and raw FastQ files from public and private databases)

See the changelog: https://t.co/bTdNCvvpz5
— nf-core (@nf_core) May 17, 2022

Pipeline release! nf-core/fetchngs v1.4 (Pipeline to fetch metadata and raw FastQ files from public and private databases)

See the changelog: https://t.co/g30m1Sq9MY
— nf-core (@nf_core) 2021年11月9日

Pipeline release! nf-core/fetchngs v1.0 (Pipeline to fetch metadata and raw FastQ files from public databases)

See the changelog: https://t.co/o7D2KVL8wO
— nf-core (@nf_core) 2021年6月8日

ここではローカルマシン (ubuntiu18.04LTS)でテストする。

インストール

Install Nextflow (>=21.04.0)
Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (please only use Conda as a last resort; see docs)
Download the pipeline and test it on a minimal dataset with a single command:

Github

テストラン

profile testを使う。ここでは”profile docker”でランする。

nextflow run nf-core/fetchfastq -profile test,docker

出力

f:id:kazumaxneo:20210614114006p:plain

samplesheet.csv

f:id:kazumaxneo:20210614114037p:plain

実際のラン

"--input"でSRA、ENA、GEOなどのIDが１行１IDの形式で書かれたテキストファイルを指定する。

nextflow run nf-core/fetchfastq --input input.txt --outdir output_dir -profile docker

--input File containing SRA/ENA/GEO identifiers one per line to download their associated metadata and FastQ files.

認識可能なIDについて

v1.4でDDBJのIDがサポートされました。リリースを確認して下さい。

引用

fetchngs » nf-core

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13.