エラーの多いロングリードのハイブリッドエラーコレクションツール Ratatosk

2020 7/26 追記

2022/06/03 help更新

　全ゲノムシークエンシングのルーチン化には、ショートリードシークエンシング（SRS）技術を補完するロングリードシークエンシング（LRS）技術が不可欠になってきている。LRSプラットフォームは10³ から10⁶塩基のDNAフラグメントリードを生成するため、ゲノムの再構成や解析のためにSRSリードで残された多くの不確実性を解決することができる。特に、LRSはリードの長さが短いSRSでは検出できなかった長くて複雑な構造変異を特徴づけることができる。さらに、LRSリードを用いて作成されたアセンブリは、これまでアクセスできなかったテロメリック領域やセントロメリック領域にまたがっているため、SRSリードを用いた場合よりもかなり連続したものになる。しかし、LRSリードを採用する上での大きな課題は、SRSリードに比べてエラー率が最大15%と非常に高いことであり、下流の解析パイプラインに障害をもたらすことになる。

　正確なショートリードから構築された圧縮されたcolored de Bruijn graph に基づいて、エラーのあるロングリードに対する新しいエラー修正手法であるRatatoskを紹介する。ショートリードとロングリードはグラフ内のパスに色を付け、頂点には候補となる一塩基多型がアノテーションされている。ロングリードはその後、正確または非正確なfc-merマッチを使用してグラフに固定され、修正された配列に対応するパスを見つける。Ratatoskは、オックスフォード・ナノポア・リードの生のエラー率を平均6倍削減し、エラー率の中央値は0.28%に抑えられることを実証した。Ratatoskで補正されたデータは、ほぼ99%の精度のSNPコールを維持し、生データと比較してindelコールの精度を最大約40%向上させる。Ratatoskで修正されたOxford Nanoporeリードから作成されたAshkenazi個体HG002のアセンブリでは、コンティグN50が43.22 Mbpとなり、PacBio HiFiリードを使用した高品質のLRSアセンブリを上回る結果が得られた。特に、Ratatoskで修正済みリードを用いたアセンブリは、PacBio HiFiリードを用いたアセンブリと比較して、約2.5倍もエラーが少なくなった。

インストール

ビルド依存

C++11 compiler:
GCC >= 4.8.5
Clang >= 3.5
Cmake >= 2.8.12
Zlib

Github

git clone --recursive https://github.com/DecodeGenetics/Ratatosk.git
cd Ratatosk
mkdir build && cd build
cmake ..
make -j
make install

> Ratatosk

Ratatosk 0.7.6

Hybrid error correction of long reads using colored de Bruijn graphs

Usage: Ratatosk [COMMAND] [PARAMETERS]

Usage: Ratatosk --help

Usage: Ratatosk --version

Usage: Ratatosk --cite

[COMMAND]:

correct Correct long reads with short reads

index Prepare a Ratatosk index (advanced)

Use "Ratatosk [COMMAND] --help" to get a specific command help

> Ratatosk correct

Ratatosk 0.7.6

Hybrid error correction of long reads using colored de Bruijn graphs

Usage: Ratatosk [COMMAND] [PARAMETERS]

Usage: Ratatosk --help

Usage: Ratatosk --version

Usage: Ratatosk --cite

[COMMAND]: correct

[PARAMETERS]:

> Mandatory with required argument:

-s, --in-short Input short read file to correct (FASTA/FASTQ possibly gzipped)

List of input short read files to correct (one file per line)

-l, --in-long Input long read file to correct (FASTA/FASTQ possibly gzipped)

List of input long read files to correct (one file per line)

-o, --out-long Output corrected long read file

> Optional with required argument:

-c, --cores Number of cores (default: 1)

-S, --subsampling Rate of short reads subsampling (default: Auto)

-t, --trim-split Trim and split bases with quality score < t (default: no trim/split)

Only sub-read with length >= 63 are output if used

-u, --in-unmapped-short Input read file of the unmapped short reads (FASTA/FASTQ possibly gzipped)

List of input read files of the unmapped short reads (one file per line)

-a, --in-accurate-long Input high quality long read file (FASTA/FASTQ possibly gzipped)

List of input high quality long read files (one file per line)

(Those reads are NOT corrected but assist the correction of reads in input)

-g, --in-graph Load graph file prepared with the index command

-d, --in-unitig-data Load unitig data file prepared with the index command

> Optional with no argument:

-v, --verbose Print information

[ADVANCED PARAMETERS]:

> Optional with required argument:

-m, --min-conf-snp-corr Minimum confidence threshold to correct a SNP (default: 0.9)

-M, --min-conf-color2 Minimum confidence threshold to color vertices for 2nd pass (default: 0)

-C, --min-len-color2 Minimum length of a long read to color vertices for 2nd pass (default: 3000)

-i, --insert-sz Insert size of the input paired-end short reads (default: 500)

-k, --k1 Length of short k-mers for 1st pass (default: 31)

-K, --k2 Length of long k-mers for 2nd pass (default: 63)

-w, --max-len-weak1 Do not correct non-solid regions >= w bases during 1st pass (default: 1000)

-W, --max-len-weak2 Do not correct non-solid regions >= w bases during 2nd pass (default: 5000)

> Optional with no argument:

-1, --1st-pass-only Perform *only* the 1st correction pass (default: false)

-2, --2nd-pass-only Perform *only* the 2nd correction pass (default: false)

[EXPERIMENTAL PARAMETERS]:

> Optional with required argument:

-L, --in-long_raw Input long read file from 1st pass (FASTA/FASTQ possibly gzipped)

List of input long read files to correct (one file per line)

-p, --in-short-phase Input short read phasing file (diploid only)

List of input short read phasing files (one file per line)

-P, --in-long-phase Input long read phasing file (diploid only)

List of input long read phasing files (one file per line)

実行方法

de novo correction

修正するロングリードと、修正のためのショートリードを指定する。CPU利用効率が高いので、利用できるならCPUコア数は多めに指定しておく。

Ratatosk correct -v -c 40 -s input_short_reads.fq -l input_long_reads.fq -o out_long_reads.fq

#paired-end
Ratatosk correct -v -c 40 -s pair*.fq.gz -l input_long_reads.fq.gz -o out_long_reads.fq

-s Input short read file to correct (FASTA/FASTQ possibly gzipped). List of input short read files to correct (one file per line)
-l Input long read file to correct (FASTA/FASTQ possibly gzipped). List of input long read files to correct (one file per line)
-o Output corrected long read file
-c Number of cores (default: 1)
-q Output Quality Scores: corrected bases get QS >= t (default: t=0, no output)
-t Trim bases with quality score < t (default: t=0, no trimming)

出力

> seqkit stats

f:id:kazumaxneo:20200726164236p:plain

リード数は変化しない（*１）。

Reference-guided correction

wikiを参照

引用

Ratatosk – Hybrid error correction of long reads enables accurate variant calling and assembly

Guillaume Holley, Doruk Beyter, Helga Ingimundardottir, Snædis Kristmundsdottir, Hannes P. Eggertsson, Bjarni V. Halldorsson

bioRxiv, Posted July 15, 2020

*１

エラー修正したロングリードを用いて、FLyeによるアセンブリパフォーマンスを簡単に比較した。

条件を揃えるため、エラー修正前のリードとエラー修正後のリードを両方flyeのrawモードでアセンブリしてQUASTで評価した結果が以下になる。右がエラー修正したリードをした時、左がエラー修正前のリードを使用した時のミスマッチ率（リファレンスと比較）。特にindelミスマッチが顕著に減っていた。エラーコレクションツールはパラメータに敏感で扱いづらいものが多いが、Ratatoskは特にエラーも起きず、またランタイムも短いため、とても使いやすかった（単なる感想です）。

f:id:kazumaxneo:20200726190500p:plain