ナノポアR10リードのリピートおよびハプロタイプを考慮したエラー修正を行う DeChat

　エラーの自己修正は、ロングリードシークエンシングデータの解析において極めて重要な最初のステップである。しかし、この目的のための既存のメソッドのほとんどは、主にエラー率が5％を超えるノイズの多いシーケンスデータ用に調整されており、多くの場合、リピートやハプロタイプの真のバリアントをcollapseしてしまう。また、PacBio HiFiリード用に最適化された手法もあり、一般的にエラー率が2%以下である高精度または超高精度モデルでベースコールされたNanopore R10リード用に特別に設計された手法には空白がある。ここでは、Nanopore R10リード専用に設計された新しいアプローチであるDeChatを紹介する。DeChatは、リピートおよびハプロタイプを考慮したエラー修正を可能にし、de Bruijnグラフとバリアントを考慮したマルチプルシークエンシングアライメントの両方の長所を活用して相乗的なアプローチを構築する。このアプローチにより、リードの過剰修正が回避され、シーケンスエラーが正確に修正される一方で、リピートやハプロタイプのバリアントが確実に保持される。ベンチマーク実験により、DeChatを用いて修正されたリードは、現在の最先端のアプローチと比較して、数倍から2桁低いエラー率である事が明らかになった。さらに、エラー修正にDeChatを適用することで、様々な側面からゲノムアセンブリが大幅に改善される。DeChatは、非常に効率的で、スタンドアロンで、ユーザーフレンドリーなソフトウェアとして実装されており、https://github.com/LuoGroup2023/DeChatから利用できる。

インストール

ubuntu22でビルドした。

ビルド依存

gcc 9.5+
cmake 3.2+
zlib
boost 1.67

Github

https://github.com/LuoGroup2023/DeChat

#conda
mamba create -n dechat -y
conda activate dechat
mamba install -c bioconda dechat -y

#Install from source code
git clone https://github.com/LuoGroup2023/DeChat.git
mamba create -n dechat boost=1.67.0
conda activate dechat

> dechat -h

Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat

Usage: dechat [options] -o <output> -t <thread> -i <reads> <...>

Options:

Input/Output:

-o STR prefix of output files [(null)]

The output for the first round of correction is "recorrected.fa",

The final corrected file is "file name".ec.fa.;

-t INT number of threads [1]

-h show help information

--version show version number

-i input reads file

-k INT k-mer length (must be <64) [21]

Error correction round 1 (dBg):

-r1 set the maximal abundance threshold for a k-mer in dBG [2]

Error correction round 2 (alignment):

-r round of correction in alignment [3]

-e maximum allowed error rate used for filtering overlaps [0.04]

実行方法

入力のリードと出力ファイルのprefixを指定する。

git clone https://github.com/LuoGroup2023/DeChat.git
cd DeChat/example/
dechat -i reads.fa.gz -o putprefix -t 20

reads.fa.gzはおよそ20,600リードで、gzip圧縮されたデータのファイルサイズは29Mb。平均長は5.05kb。

出力

最終的にoutpreifx.ec.faが出力される。リード数は変化なしで、平均長は20塩基ほど短く変化している。ランタイムは2分程度だった。

引用

Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat

Yichen Li, Enlian Chen, Jialu Xu, Wenhai Zhang, Xiangxiang Zeng, Yuansheng Liu, Xiao Luo

bioRixv, Posted May 10, 2024.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ナノポアR10リードのリピートおよびハプロタイプを考慮したエラー修正を行う DeChat