ラージゲノムにもスケールする高速なドラフトゲノム配列polishingツール ntEdit

2019 5/17 論文引用、タイトル修正

2020 10/9 コマンド修正

2021 9/15 インストール手順追加

2022/06/05 condaインストール追記

　この10年間で、次世代シーケンシングテクノロジはスループットを大幅に向上させた。例えば、今日では、20 Gbpの針葉樹ゲノムの50倍のカバレッジシーケンシングもIllumina HiSeq-Xマシンなら8レーンフローセル１回で達成できる。しかし、この膨大なデータはバイオインフォマティクスパイプラインにボトルネックを生み出している。典型的には、ショートリードデータでは、未解決の対立遺伝子mixtureを表す偽一倍体ドラフトゲノムは、二倍体配列アセンブリから生じる。使用される方法によっては、これらのアセンブリにかなりの誤差が含まれる可能性がある。ハイスループットシークエンシングプラットフォームのリードのエラー修正のための多くのツールが存在するが[ref.1]、ゲノムアセンブリpolishingツールはほとんど利用可能なものがない。
　アセンブリpolishingの主な用途には、GATK [ref.2]、Pilon [ref.3]、Racon [ref.4]がありる。 PilonとGATKは、ゲノム改良のための確立された包括的なツールであり、短いギャップを埋め、局所的な間違いを修正し、そして変異の塩基を同定し報告する能力を含む。比較すると、Raconはもともと高速ナノポアリード訂正ツールとして設計された、より最近のユーティリティである。後者は、イルミナのデータ、Pacific Biosciences（PacBio）やOxford Nanopore（Nanopore）のシーケンスリードからアセンブリされたものなどの１分子シークエンシング（SMS）ゲノムドラフトを使用して、polishingをかけている[ref. 5]。最新のpolishing精度を考慮すると、Pilonは、微生物および小さな真核生物ゲノム（<100 Mbp）のpolishingに日常的に使用されている堅牢なゲノムアセンブリ改善ツールである。これは人のアセンブリにも適用されている[ref.6]が、残念なことに時間的に2次的に拡大縮小される。
　前述のツールはすべてリードのアライメントを採用している。このパラダイムは、実行時間を犠牲にしても、精査の下で塩基にコンテキストを与える。これらのスケーラビリティの限界に対処するために、本著者らはntEditを開発した。これは、非常に大きなゲノム（> 3Gbp）アセンブリのホモ接合エラーを修正するために長さk（kmer）のワードを使用するユーティリティである。 ntEditは、評価と修正に簡潔なブルームフィルタのデータ構造を採用している。それを他のツールの基本polishing能力、すなわち塩基置換とindelsを修正する能力と比較することによって、ntEditがどのように匹敵する結果を生み出し、そしてヒトの3Gbpゲノムと、トウヒのlarge 20Gbpゲノムに直線的に比例するかを示す。

ntEdit polishing of the Axolotl (mexican salamander) PacBio draft genome, largest yet@32Gbp. After ntHits on avail. Illumina seqs(<8X), https://t.co/Pf9uFsjMp0 took <20m, 94GB RAM & made 59M edits+fixed frame shift errors to recover 106 (3%) xtra complete BUSCO genes. Just sayin
— René Warren (@WarrenRene) April 12, 2019

インストール

ubuntu16.04でテストした（docker使用。ホストOS macos10.14）。

ビルド依存

c++のコンパイラ、zlib、make、autoconf、automakeなど

依存

ntHits (https://github.com/bcgsc/nthits)
BloomFilter utilities (provided in ./lib)
kseq (provided in ./lib)

#ntHitsのビルド
git clone https://github.com/bcgsc/ntHits.git
cd ntHits/
./autogen.sh
./configure
make
make install

> nthits --help

# nthits --help

Usage: nthits [OPTION]... FILES...

Reports the most frequent k-mers in FILES(>=1).

Accepatble file formats: fastq, fasta, gz, bz, zip.

Options:

-t, --threads=N use N parallel threads [16]

-k, --kmer=N the length of kmer [64]

-c, --cutoff=N the maximum coverage of kmer in output

-p, --pref=STRING the prefix for output file name [repeat]

--outbloom output the most frequent k-mers in a Bloom filter

--solid output the solid k-mers (non-errornous k-mers)

--help display this help and exit

--version output version information and exit

本体　Github

#nthits
git clone https://github.com/bcgsc/ntHits.git
cd ntHits/
./autogen.sh
./configure
make 
sudo make install 

#ntEdit
git clone https://github.com/bcgsc/ntEdit.git
cd ntEdit
make ntEdit

#conda (未テスト)
mamba create -n ntedit -y
conda activate ntedit
mamba install -c bioconda ntedit nthits -y

> ./ntedit --help

# ./ntedit --help

ntEdit v1.2.0

Scalable genome sequence polishing.

Options:

-t, number of threads [default=1]

-f, Draft genome assembly (FASTA, Multi-FASTA, and/or gzipped compatible), REQUIRED

-r, Bloom filter file (generated from ntHits), REQUIRED

-b, output file prefix, OPTIONAL

-k, kmer size, REQUIRED

-z, minimum contig length [default=100]

-i, maximum number of insertion bases to try, range 0-5, [default=4]

-d, maximum number of deletions bases to try, range 0-5, [default=5]

-x, k/x ratio for the number of kmers that should be missing, [default=5.000]

-y, k/y ratio for the number of editted kmers that should be present, [default=9.000]

-c, cap for the number of base insertions that can be made at one position, [default=k*1.5]

-m, mode of editing, range 0-2, [default=0]

0: best substitution, or first good indel

1: best substitution, or best indel

2: best edit overall (suggestion that you reduce i and d for performance)

-v, verbose mode (-v 1 = yes, default = 0, no)

--help, display this message and exit

--version, output version information and exit

実行方法

1、ショートリードを指定する。ここではペアエンドのシーケンシングリードを指定している。カバレッジが30x以上のデータを使うことが推奨されているが (-c2 (>=20X)) 、--c（--cutoff=N）フラグの変更でlow coverageのデータにも対応する（Github参照）。論文ではk＝25とk＝20が使われている。

nthits --outbloom -p solidBF -k 25 -t 24 pair_R1.fq.gz pair_R2.fq.gz

-c the maximum coverage of kmer in output
-k the length of kmer [64]
-t use N parallel threads [16]
-p the prefix for output file name [repeat]

solidBF_k25.bfが出力される。

２、ドラフトアセンブリ配列を指定してpolishingを実行する。ここでは20スレッド指定。

ntedit -f draft_assembly.fa -r solidBF_k25.bf -b output -t 20

-t number of threads [default=1]
-f Draft genome assembly (FASTA, Multi-FASTA, and/or gzipped compatible),
-k kmer size, REQUIRED
-b output file prefix, OPTIONAL

ランが終わるとエラーの位置を示したoutput_changes.tsvと、polishingされたoutput_edited.faが出力される。

ジョブは非常に早く終わる。small data-setとして使用したロングリードのドラフトアセンブリ4Mbpのpolishingは、およそ2秒ほどで終わった（*1）。

引用

ntEdit: scalable genome assembly polishing
Ren ́e L Warren, Lauren Coombe, Hamid Mohamadi, Jessica Zhang, Barry Jaquish, Nathalie Isabel, Steven JM Jones, Jean Bousquet, Joerg Bohlmann, Inanc ̧ Birol

bioRxiv preprint first posted online Mar. 26, 2019

ntEdit: scalable genome sequence polishing
René L Warren Lauren Coombe Hamid Mohamadi Jessica Zhang Barry Jaquish Nathalie Isabel Steven J M Jones Jean Bousquet Joerg Bohlmann Inanç Birol
Bioinformatics, btz400, Published: 16 May 2019

旧mac pro (2012) X5690 dualの全CPUリソースを使用。ショートリードはペアエンドfastq1GBx2。