超高速なfastqの前処理ツール RabbitQCPlus - macでインフォマティクス

2023/01/28 追記

　シーケンサーデータの品質評価は、ダウンストリームデータ解析において重要な役割を担っている。しかし、既存のツールは、特に圧縮ファイルを扱う場合や、過剰発現解析のような複雑な品質管理操作を行う場合、最適とは言えない効率を達成することが多い。本発表では、最新のマルチコアシステムに対応した超高効率的な品質管理ツールRabbitQCPlusを紹介する。RabbitQCPlusはベクトル化、メモリコピー削減、並列（デ）圧縮、最適化されたデータ構造により、大幅な性能向上を実現している。基本的な品質管理操作を行う場合、最新のアプリケーションと比較して1.1～5.4倍高速になり、必要な計算資源も少なくなる。さらに、RabbitQCPlusはgzip圧縮されたFASTQファイルを処理する場合、他のアプリケーションと比較して少なくとも4倍高速に処理することができる。さらに、280GBのプレーンなFASTQシーケンスデータを処理するのに48コアのサーバーでリードごとの過剰発現解析を有効にすると、他のアプリケーションでは少なくとも22分かかるが、RabbitQCPlusは4分未満で済む。C++ソースコードは https://github.com/RabbitBio/RabbitQCPlus で入手できる。

インストール

ビルド依存

gcc 4.8.5 or newer
zlib

make時、システムが利用可能なCPUの命令セットは自動認識される。手動で設定することもでき、拡張命令avx256とavx512にも対応している。詳しくはレポジトリを参照。

Github

git clone https://github.com/RabbitBio/RabbitQCPlus.git
cd RabbitQCPlus
make -j4
make install

> RabbitQCPlus -h

RabbitQCPlus

Usage: RabbitQCPlus [OPTIONS]

Options:

-h,--help Print this help message and exit

-i,--inFile1 TEXT input fastq name 1

-I,--inFile2 TEXT input fastq name 2

-o,--outFile1 TEXT output fastq name 1

-O,--outFile2 TEXT output fastq name 2

--compressLevel INT output file compression level (1 - 9), default is 4

--overWrite overwrite out file if already exists.

--phred64 input is using phred64 scoring, default is phred33

--stdin input from stdin, or -i /dev/stdin, only for se data or interleaved pe data(which means use --interleavedIn)

--stdout output to stdout, or -o /dev/stdout, only for se data or interleaved pe data(which means use --interleavedOut)

--notKeepOrder do not keep order as input when outputing reads (may slightly improve performance if opened), default is off

-a,--noTrimAdapter don't trim adapter, default is off

--decAdaForSe detect adapter for se data, default is on

--decAdaForPe detect adapter for pe data, default is off, tool prefers to use overlap to find adapter

--printWhatTrimmed if print what trimmed to *_trimmed_adapters.txt or not, default is not

--adapterSeq1 TEXT specify adapter sequence for read1

--adapterSeq2 TEXT specify adapter sequence for read2

--adapterFastaFile TEXT specify adapter sequences use fasta file

-c,--correctData correcting low quality bases using information from overlap, default is off

-w,--threadNum INT number of thread used to do QC, including (de)compression for compressed data, default is 8

-5,--trim5End do sliding window from 5end to 3end to trim low quality bases, default is off

-3,--trim3End do sliding window from 5end to 3end to trim low quality bases, default is off

--trimFront1 INT number of bases to trim from front in read1, default is 0

--trimFront2 INT number of bases to trim from front in read2, default is 0

--trimTail1 INT number of bases to trim from tail in read1, default is 0

--trimTail2 INT number of bases to trim from tail in read2, default is 0

-g,--trimPolyg do polyg tail trim, default is off

-x,--trimPolyx do polyx tail trim, default is off

-u,--addUmi do unique molecular identifier (umi) processing, default is off

--umiLen INT umi length if it is in read1/read2, default is 0

--umiLoc TEXT specify the location of umi, can be (index1/index2/read1/read2/per_index/per_read), default is 0

--umiPrefix TEXT identification to be added in front of umi, default is no prefix

--umiSkip INT the number bases to skip if umi exists, default is 0

--TGS process third generation sequencing (TGS) data (only for se data, does not support trimming and will not produce output files), default is off

-p,--doOverrepresentation do over-representation sequence analysis, default is off

-P,--overrepresentationSampling INT

do overrepresentation every [] reads, default is 20

--printORPSeqs if print overrepresentation sequences to *ORP_sequences.txt or not, default is not

--noInsertSize no insert size analysis (only for pe data), default is to do insert size analysis

--interleavedIn use interleaved input (only for pe data), default is off

--interleavedOut use interleaved output (only for pe data), default is off

-V,--version application version

実行方法

ショートリード

#single-end、12スレッド
RabbitQCPlus -w 12 -i in1.fastq.gz -o p1.fastq.gz

#paired-end、20スレッド
RabbitQCPlus -w 40 -i in1.fastq.gz -I in2.fastq.gz -o p1.fastq.gz -O p2.fastq.gz

ロングリード

RabbitQCPlus -w 4 -i in.fastq --TGS

オプションを指定しない場合もhtmlレポートも出力される。レポジトリの例。

ファイルサイズ12GBの生のペアエンドのfastqファイルの前処理にかかった時間は、

64 CPU - 12.5秒

40 CPU - 15.0秒

20 CPU - 31.7秒

という結果でした（TR3990x、256GBメモリ、PCIe4接続SSD環境、n=1）。

このストレージ上で同じファイルをコピーして書き出すとおよそ10.5秒かかりました（ubuntuのcpコマンド使用）。12.5秒というタイムは理論限界値に近い値が出ていると言えるんじゃないかと思います（注；本SSDのシーケンシャル読み書きのスペック値は6GB/s）。avx256やavx512拡張命令に対応した最近のメニーコアCPUとPCIe5以上の複数レーンに対応した高速なストレージを組み合わせると、より時間短縮できるかもしれません。

引用

RabbitQCPlus: More Efficient Quality Control for Sequencing Data

2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

DOI Bookmark: 10.1109/BIBM55620.2022.9995332