SRAなどのシーケンシングデータを一括ダウンロードする grabseqs

2020 4/1 タイトル修正、誤字修正

2020 10/24 仮想環境を解くって導入するように修正

2021 5/23 conda => mambaに修正

　ハイスループットシーケンシングは、生物学的な疑問を解決するための強力な技術である。Grabseqsは、Sequence Read Archive（SRA）、Metagenomics Rapid Annotation through Subsystems Technology（MG-RAST）サーバー、iMicrobeを含む複数のリポジトリからデータとメタデータをダウンロードするための使いやすい単一のインターフェースを提供することで、一般に公開されているメタゲノムデータへのアクセスを効率化する。ユーザーは、1つの grabseqs コマンドで、任意のリポジトリから任意の数のサンプルやプロジェクトのデータやメタデータを標準化された形式でダウンロードすることができる。
　GrabseqsはPythonで実装され、MITライセンスの下でライセンスされたオープンソースのツールである。ソースコードは https://github.com/louiejtaylor/grabseqs、Python Package Index (PyPI)、Anaconda Cloud リポジトリから自由に入手できる。

インストール

ubuntu18.04LTSでpipを使ってテストした。

依存

Python 3 (external packages req'd: requests, requests-html, pandas, fake-useragent)

本体　Github

#grabseqsはpipにも対応（macosはpipのみ対応)
pip install grabseqs

#bioconda
mamba create -n grabseqs python=3.8 -y
conda activate grabseqs
mamba install grabseqs -c louiejtaylor -c bioconda -c conda-forge -y

#pigzやsra-toolsもないなら導入
mamba install pigz
mamba install -c bioconda -y sra-tools

> grabseqs --help

# grabseqs --help

usage: grabseqs [-h] [--version] {sra,imicrobe,mgrast} ...

Download metagenomic sequences from public datasets.

positional arguments:

{sra,imicrobe,mgrast}

repositories available

sra download from SRA

imicrobe download from iMicrobe

mgrast download from MG-RAST

optional arguments:

-h, --help show this help message and exit

--version, -v show program's version number and exit

> grabseqs sra -h

# grabseqs sra -h

usage: grabseqs sra [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES] [-t THREADS]

[-f] [-l] [--parse_run_ids] [--use_fastq_dump]

[--custom_fqdump_args CUSTOM_FQD_ARGS] [--no_parsing]

id [id ...]

positional arguments:

id One or more BioProject, ERR/SRR or ERP/SRP number(s)

optional arguments:

-h, --help show this help message and exit

-m METADATA filename in which to save SRA metadata (.csv format,

relative to OUTDIR)

-o OUTDIR directory in which to save output. created if it

doesn't exist

-r RETRIES number of times to retry download

-t THREADS threads to use (for fasterq-dump/pigz)

-f force re-download of files

-l list (but do not download) samples to be grabbed

--parse_run_ids parse SRR/ERR identifers (do not pass straight to

fasterq-dump)

--use_fastq_dump use legacy fastq-dump instead of fasterq-dump (no

multithreaded downloading)

--custom_fqdump_args CUSTOM_FQD_ARGS

'string' containing args to pass to fast(er)q-dump

--no_parsing Legacy option to not parse SRR IDs (now default)

> grabseqs mgrast -h

# grabseqs mgrast -h

usage: grabseqs mgrast [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]

[-t THREADS] [-f] [-l]

rastid [rastid ...]

positional arguments:

rastid One or more MG-RAST project or sample identifiers

(mgp####/mgm######)

optional arguments:

-h, --help show this help message and exit

-m METADATA filename in which to save metadata (.csv format, relative to

OUTDIR)

-o OUTDIR directory in which to save output. created if it doesn't exist

-r RETRIES number of times to retry download

-t THREADS threads to use (for pigz)

-f force re-download of files

-l list (but do not download) samples to be grabbed

> grabseqs imicrobe -h

# grabseqs imicrobe -h

usage: grabseqs imicrobe [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES]

[-t THREADS] [-f] [-l]

imicrobeid [imicrobeid ...]

positional arguments:

imicrobeid One or more iMicrobe project or sample identifiers (p##/s###)

optional arguments:

-h, --help show this help message and exit

-m METADATA filename in which to save metadata (.csv format, relative to

OUTDIR)

-o OUTDIR directory in which to save output. created if it doesn't exist

-r RETRIES number of times to retry download

-t THREADS threads to use (for pigz)

-f force re-download of files

-l list (but do not download) samples to be grabbed

実行方法

SRAの全サンプル（All runs）をダウンロードする。メタデータもダウンロードする。

grabseqs sra -t 8 -m metadata.csv -o outdir SRP#######

-o directory in which to save output. created if it doesn't exist
-r number of times to retry download
-t threads to use (for fasterq-dump/pigz)
-m filename in which to save SRA metadata (.csv format, relative to OUTDIR)

複数指定にも対応している。さらにSRA/ERP ProjectとBioProjects、runsの混合も可能。

grabseqs sra -t 8 -m metadata.csv -o outdir -r 2 SRR######## ERP####### PRJNA######## ERR########

Run内の単一のシーケンシングデータだけダウンロードする。

grabseqs sra -t 8 -m metadata.csv -o outdir SRR6032562

#複数
grabseqs sra -t 8 -m metadata.csv -o outdir SRR6032562 SRR6032563 SRR6032564

ドライラン

grabseqs sra -l SRP########

-l list (but do not download) samples to be grabbed

> grabseqs sra -l PRJNA590266

SRR10488335.fastq.gz

SRR10488336.fastq.gz

SRR10488337.fastq.gz

SRR10488338.fastq.gz

SRR10488339.fastq.gz

SRR10488340.fastq.gz

MG-RASTのプロジェクトをダウンロードする場合、サブコマンド"mgrast"を使う。

grabseqs mgrast -t 8 -m metadata.csv -o outdir -r 2 mgp##### mgm#######

注；MG-RASTのプロジェクトの多くは、一般には公開されていません。もし、特定のアクセッション番号やプロジェクトに問題がある場合は、まずMG-RASTのウェブサイトに行き、手動でダウンロードできるかどうか確認してください。以前はダウンロード可能であった多くのサンプルが現在では入手不可能になっているため、まずこれを実行してください（FAQより）。

iMicrobeの例はGithubで確認して下さい。

引用

grabseqs: Simple downloading of reads and metadata from multiple next-generation sequencing data repositories
Louis J Taylor, Arwa Abbas, Frederic D Bushman
Bioinformatics, btaa167, Published: 10 March 2020