2ラウンドのオーバーラッピングとキャッシュに基づく高速エラー訂正を行う Fec

　第3世代シーケンサーは長いリード長でゲノム解析を進めるが、リードのエラーレートが高いため、エラー訂正が必要になる。特にシーケンスカバレッジが高い場合、エラー訂正は時間のかかる作業である。一般に、既存の誤り訂正手法は、重複するリードAを訂正する際にBからAへの塩基レベルアラインメントを行い、リードBを訂正する際にAからBへの別の塩基レベルアラインメントを行うが、著者らの観測によれば、塩基レベルアラインメントの情報を再利用することが可能である。本論文では、2ラウンドのオーバーラッピングとキャッシングを用いた高速誤り訂正ツールFecを紹介する。Fecは単独で、あるいはアセンブリパイプラインのエラー訂正ステップとして使用することができる。第1ラウンドでは、Fecは大きなウィンドウサイズ(20)を用いて、ほとんどのリードを修正するのに十分な重複を高速に見つけることができる。第2ラウンドでは、第1ラウンドで重複が不十分だったリードに対して、小さなウィンドウサイズ(5)でより多くの重複を見つける。塩基配列のアラインメントを行う場合、Fecはまずキャッシュを検索する。キャッシュにアラインメントが存在する場合、Fecはこのアラインメントを取り出し、そこから2回目のアラインメントを推論する。そうでない場合は、Fecはベースレベルアラインメントを行い、アラインメントをキャッシュに格納する。Fecを9つのデータセットでテストした結果、5つのPacBioデータセットでMECAT, CANU, MINICNSと比較して1.24〜38.56倍、4つのnanoporeデータセットでNECAT, CANUと比較して1.16〜27.8倍高速化されることが確認された。

インストール

condaを使ってインストールした。

#conda(link)
mamba create -n fec -y
conda activate fec
mamba install -c bioconda fec -y

> Fec

USAGE:

Fec [options] input reads output

OPTIONS:

-x <0/1> data type: 0 = PacBio, 1 = Nanopore

-t <Integer> number of threads (CPUs)

-p <Integer> batch size that the reads will be partitioned

-r <Real> minimum mapping ratio

-a <Integer> minimum overlap size

-c <Integer> minimum coverage under consideration

-l <Integer> minimum length of corrected sequence

-k <Integer> number of partition files to open at one time (if < 0, then it will be set to system limit value)

-e <Integer> use cache or not: 0 = not use, 1 = use

-s <Integer> perform second-round overlapping or not: 0 = not perform, 1 = perform

-m <String> filter out top fraction repetitive minimizers of the second-round overlapping

-f <String> minimum overlap ratio used for the second-round overlapping filtering

-K <Integer> k-mer size or the second-round overlapping

-w <Integer> minimizer window size for the second-round overlapping

-H use homopolymer-compressed k-mer for the second-round overlapping

-R resuse long indel

-F full consensus

-h print usage info.

Default Options:

-t 1 -p 100000 -r 0.6 -a 1000 -c 4 -l 2000 -k 100 -e 1 -s 1 -m 0.0002 -f 0.6 -K 15 -w 5 -H

実行方法

PacBio

#small datasets
minimap2 -x ava-pb -w 20 -K 2g -t 20 reads.fq reads.fq | awk '{ if($4 - $3 >= 0.5 * $2 || $9 - $8 >= 0.5 * $7) print $0}' > ovlp.paf
Fec -t 20 -r 0.6 -a 1000 -c 4 -l 2000 ovlp.paf reads.fq corrected.fasta

#large datasets (人など)
minimap2 -x ava-pb -w 20 -K 2g -f 0.005 -t 20 reads.fq reads.fq | awk '{ if($4 - $3 >= 0.2 * $2 || $9 - $8 >= 0.2 * $7) print $0}' > ovlp.paf
Fec -t 20 -r 0.6 -a 1000 -c 4 -l 2000 -m 0.005 -f 0.2 ovlp.paf reads.fq corrected.fasta

注；Fecには非圧縮のシークエンシングリードを提供する必要がある。

Nanopore

#two round correction
minimap2 -x ava-ont -w 20 -K 2g -f 0.005 -t 20 ONT.fq ONT.fq | awk '{ if($4 - $3 >= 0.2 * $2 || $9 - $8 >= 0.2 * $7) print $0}' > ovlp.paf
Fec -x 1 -t 20 -r 0.6 -a 400 -c 0 -l 1000 -m 0.005 -f 0.2 ovlp.paf  ONT.fq corrected1.fasta
minimap2 -x ava-ont -w 20 -K 2g -f 0.005 -t 20 corrected1.fasta corrected1.fasta | awk '{ if($4 - $3 >= 0.2 * $2 || $9 - $8 >= 0.2 * $7) print $0}' > ovlp2.paf
Fec -x 1 -R -t 20 -r 0.6 -a 1000 -c 4 -l 2000 -m 0.005 -f 0.2 ovlp2.paf corrected1.fasta corrected2.fasta

出力例

（ONT.fqが入力）