SRA Toolkitのfasta-dumpを高速化した fasterq-dump

2019 4/29 複数ファイルダウンロード例、8/13 ダウンロード例のコード修正、12/18 インストールエラー修正、12/21 実行例追記

2020 1/21 ダウンロード例のコード修正、4/1 リンク追加

2023/07/22 docker イメージ例追加

タイトルの通りのコマンド。使い方だけ簡単に紹介します。

worked all day on a bash scrip to fetch & convert all European and African @1000genomes SRA files. <for i in *.sra ; do fasterq-dump $i -O ./ -t $home/Desktop/fasterqdumptempfiles -e 12 -S -p ; done > mac is smoking now.
— Phillip Buckhaults (@P_J_Buckhaults) February 12, 2019

Fasterq-dump is about 4x faster when coupled with pigz (assuming you want .fastq.gz) when tested on a 4.6 GB sra file
https://t.co/G0J1LpXvoI
— Ben Johnson (@biobenkj) November 22, 2018

インストール

sra-toolsを導入すれば使える。

downloandリンク

https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/

Github

#bioconda
mamba create -n sratools -y
conda activate sratools
mamba install -c bioconda -y sra-tools

#docker（ayyuan氏のイメージ。おそらく非公式）
docker pull ayyuan/fasterq-dump:1.0

#後日(2019 12/18)別のマシンでテスト時にfasterq-dumpがインストールされなかったので、バイナリをanacondaサーバ（link）からダウンロードした。

#darwin
wget https://anaconda.org/bioconda/sra-tools/2.10.0/download/osx-64/sra-tools-2.10.0-pl526h6de7cb9_0.tar.bz2
#解凍
tar zxvf sra-tools-2.10.0-pl526h6de7cb9_0.tar.bz2
#パスの通ったディレクトリに移動
mv bin/fastq-dump-orig.2.10.0 /usr/local/bin/fasterq-dump 

#linux 64bit
wget https://anaconda.org/bioconda/sra-tools/2.10.1/download/linux-64/sra-tools-2.10.1-pl526haddd2b5_0.tar.bz2

> fasterq-dump

$ fasterq-dump

Usage:

fasterq-dump <path> [options]

Options:

-o|--outfile output-file

-O|--outdir output-dir

-b|--bufsize size of file-buffer dflt=1MB

-c|--curcache size of cursor-cache dflt=10MB

-m|--mem memory limit for sorting dflt=100MB

-t|--temp where to put temp. files dflt=curr dir

-e|--threads how many thread dflt=6

-p|--progress show progress

-x|--details print details

-s|--split-spot split spots into reads

-S|--split-files write reads into different files

-3|--split-3 writes single reads in special file

--concatenate-reads writes whole spots into one file

-Z|--stdout print output to stdout

-f|--force force to overwrite existing file(s)

-N|--rowid-as-name use row-id as name

--skip-technical skip technical reads

--include-technical include technical reads

-P|--print-read-nr print read-numbers

-M|--min-read-len filter by sequence-len

--table which seq-table to use in case of pacbio

--strict terminate on invalid read

-B|--bases filter by bases

-h|--help Output brief explanation for the program.

-V|--version Display the version of the program then

quit.

-L|--log-level <level> Logging level as number or enum string. One

of (fatal|sys|int|err|warn|info|debug) or

(0-6) Current/default is warn

-v|--verbose Increase the verbosity of the program

status messages. Use multiple times for more

verbosity. Negates quiet.

-q|--quiet Turn off all status messages for the

program. Negated by verbose.

--option-file <file> Read more options and parameters from the

file.

fasterq-dump : 2.9.1 ( 2.9.1-1 )

実行方法

8スレッド指定でSRR000001をダウンロードする。進捗も表示させる。

fasterq-dump SRR000001 -e 8 -p

-e how many thread dflt=6
-p show progress

カレントにペアエンドfastqが出力される。

作業ディレクトリを/tmpに、出力ディレクトリをカレントのoutput/にして実行する。

fasterq-dump SRR000001 -O output -t /tmp -e 8 -p

-O output-dir
-t where to put temp. files dflt=curr dir

$ ls -alth output/

total 593920

-rw-r--r-- 1 kazuma wheel 208M 4 18 00:13 SRR000001_1.fastq

-rw-r--r-- 1 kazuma wheel 75M 4 18 00:13 SRR000001_2.fastq

追記

複数SRAをダウンロード（１例）

#以下の３つのペアエンドfastqをダウンロードする
#ヒアドキュメント
cat >sra_ids.txt <<EOF
SRR10712689
SRR10713826
SRR10709253
EOF

#whileで回す。12スレッド並列（*1）。pigzでgzipping（ここでは16スレッド指定）
cat sra_ids.txt | while read line; do
 fasterq-dump $line -O ./ -e 12 -p
 pigz -p 16 ${line}_1.fastq
 pigz -p 16 ${line}_2.fastq
done

Githubでは、RAMディスクが利用できる場合、作業ディレクトリをRAMディスクにして高速化する事も提案されています。

引用

Using the SRA Toolkit to convert .sra files into other formats

https://www.ncbi.nlm.nih.gov/books/NBK158900/#SRA_download.what_is_the_purpose_of_the

2020 4/1追記

こちらを使ってみて下さい。簡単に個別のデータ、プロジェクト全体のシーケンスデータ、メタデータをダウンロードできます。