高速なロングリードのアセンブリツール wtdbg2 (Redbean)

2019 4/15 Githubリンクの誤り修正、7/3 名前修正、12/11 論文引用追加、タイトル修正

2022/06/07 インストール手順修正

　デノボシーケンスアセンブリは、比較的短いシーケンシングリードからサンプルゲノムを再構築する。リファレンスゲノムは関心のある領域を欠いている可能性があるため、マッピングベースの分析に失敗することが多い、新種および構造ゲノムの変化の研究に不可欠である。 Pacific Biosciences（PacBio）およびOxford Nanopore Technologies（ONT）による一分子シークエンシング技術の急速な進歩により、10〜100キロベース（kb）のリードを手頃なコストでシークエンスすることができる。このようなロングリードは霊長類の主要なrepetitiveクラスを解決し、アセンブリの連続性を改善するのに役立つ。今日では、いくつかの高品質のアセンブラが開発されたおかげで、ロングリードアセンブリはバクテリアや小さなゲノムにとっては日常的な作業となっている（ref.1-5）。しかし、哺乳類のゲノムの場合、既存のアセンブラには数CPU年かかる場合がある。GoogleまたはAmazonのクラウドをコンピューティングに使用する場合、最大1000ドルかかることがある。（一部略）この問題に対処するために、著者らはwtdbg2を開発した。これは大きなゲノムのためのアセンブリの品質をほとんど妥協することなく何十倍も速い、新しいロングリードアセンブラである。

　Wtdbg2は、概して、重複レイアウトコンセンサスパラダイムに従っている。これは、高速の全対全リードアライメント実装と、フfuzzy-Bruijn graph (FBG) に基づく新しいレイアウトアルゴリズムによって、既存のアセンブラを進歩させる。哺乳類のゲノムの場合、現在のリードオーバーラッパー（ref.7〜9）は、入力のリードを多数の小さなバッチに分割し、バッチ間の全対全アライメントを実行する。この方法では、ファイルのI / Oが繰り返されたり、情報が得られないk-merのインデックス付けやクエリが行われたりすると、計算時間が無駄になる。これらのオーバーラッパーは、大量のメモリを消費する可能性があるので、単一のハッシュテーブルを作成しない。興味深いことに、これは大きな問題ではないはずである。 Wtdbg2は最初にすべてのリードをメモリにロードし、k-merの発生をカウントする。次に、リードを256bpごとに1つのユニットにビンし、k-merが2回以上発生していて値が関連するビンの識別子になっているハッシュテーブルを作成する。例えば、60倍のカバレッジでシーケンシングされたCHM1ヒトゲノムの場合、デフォルト設定ではわずか15億の非ユニークなk-merしかない。rawリード配列をメモリにステージングしてハッシュテーブルを構築すると、ピーク時に250GBかかる。これは、ショートリードのアセンブラのメモリ使用量に相当する。
　Wtdbg2ビンは、配列の次のステップである動的計画法（DP）を高速化するためにシーケンスを読み取る。 256bpビニングでは、DP行列は、１ベース当たりのDP行列よりも65536６（＝ 256×256）倍小さい。これにより、k-merベースまたはベースレベルのDPと比較して、DPが大幅に縮小される。

（一段落省略）
　CANU-1.8、FALCON-180831、Flye-2.3.6、MECAT-180314とともに、5つのデータセットでwtdbg2 v2.3を評価した（Preprint表1）。minimap2を使用して、アセンブリされたコンティグをリファレンスゲノムにアライメントさせ、そして測定基準を収集した。すべてのデータセットで、wtdbg2は最も近い競合の少なくとも4倍の速さだった。 Wtdbg2アセンブリは、リファレンスゲノムをややカバーしないことがあり、それがwtdbg2の弱点だが、その理由はコンティグはduplicationsが少ない傾向があるためである。 1つのコンティグによってカバーされるゲノム領域の割合は、他のアセンブラと比較して2、3パーセントだけ低い。

wtdbg2に関するツイート

Interested in a long-read assembler with the speed of miniasm and comparable consensus accuracy of canu? Try wtdbg2 by my friend Jue Ruan. A human genome in one day. Also available from bioconda. https://t.co/H3CIshjLGY
— Heng Li (@lh3lh3) October 23, 2018

追記

Jue Ruan detailing his brilliant algorithm wtdbg2 (now entitled "redbean"). Uses fuzzy 'k-bins' instead of k-mers for DeBruijn graph assembly #UQwinterschool pic.twitter.com/TpYDGmPSdA
— Martin A. Smith (@martinalexsmith) July 2, 2019

インストール

ubuntu16.04でテストした（ホストOS macos10.12）。

依存

Wtdbg2 only works on 64-bit Linux. To compile, please type make in the source code directory. You can then copy wtdbg2 and wtpoa-cns to your PATH.

本体 Github

git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make

#Anacondaを使っているならcondaで導入できる(linuxのみ)
mamba install -y -c bioconda wtdbg

> ./wtdbg2

WTDBG: De novo assembler for long noisy sequences

Author: Jue Ruan <ruanjue@gmail.com>

Version: 2.3 (20181206)

Usage: wtdbg2 [options] -i <reads.fa> -o <prefix> [reads.fa ...]

Options:

-i <string> Long reads sequences file (REQUIRED; can be multiple),

-o <string> Prefix of output files (REQUIRED),

-t <int> Number of threads, 0 for all cores, [4]

-f Force to overwrite output files

-x <string> Presets, comma delimited,

rsII/rs: -p 21 -S 4 -s 0.05 -L 5000

sequel/sq

nanopore/ont:

(genome size < 1G) -p 0 -k 15 -AS 2 -s 0.05 -L 5000

(genome size >= 1G) -p 19 -AS 2 -s 0.05 -L 5000

corrected/ccs: -p 21 -k 0 -AS 4 -K 0.05 -s 0.5

Example: '-e 3 -x ont -S 1' in parsing order, -e will be 3, -S will be 1

-g <number> Approximate genome size (k/m/g suffix allowed) [0]

-X <float> Choose the best <float> depth from input reads(effective with -g) [50]

-L <int> Choose the longest subread and drop reads shorter than <int> (5000 recommended for PacBio) [0]

Negative integer indicate keeping read names, e.g. -5000.

-k <int> Kmer fsize, 0 <= k <= 25, [0]

-p <int> Kmer psize, 0 <= p <= 25, [21]

k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>

-K <float> Filter high frequency kmers, maybe repetitive, [1000.05]

>= 1000 and indexing >= (1 - 0.05) * total_kmers_count

-E <int> Min kmer frequency, [2]

-S <float> Subsampling kmers, 1/(<-S>) kmers are indexed, [4.00]

-S is very useful in saving memeory and speeding up

please note that subsampling kmers will have less matched length

-l <float> Min length of alignment, [2048]

-m <float> Min matched length by kmer matching, [200]

-A Keep contained reads during alignment

-s <float> Min similarity, calculated by kmer matched length / aligned length, [0.05]

-e <int> Min read depth of a valid edge, [3]

-q Quiet

-v Verbose (can be multiple)

-V Print version information and then exit

--help Show more options

> ./wtpoa-cns -h

WTPOA-CNS: Consensuser for wtdbg using PO-MSA

Author: Jue Ruan <ruanjue@gmail.com>

Version: 2.3

Usage: wtpoa-cns [options]

Options:

-t <int> Number of threads, [4]

-d <string> Reference sequences for SAM input, will invoke sorted-SAM input mode

-u Only process reference regions present in/between SAM alignments

-r Force to use reference mode

-p <string> Similar with -d, but translate SAM into wtdbg layout file

-i <string> Input file(s) *.ctg.lay from wtdbg, +, [STDIN]

Or sorted SAM files when having -d

-o <string> Output files, [STDOUT]

-f Force overwrite

-j <int> Expected max length of node, or say the overlap length of two adjacent units in layout file, [1500] bp

-b <int> Bonus for tri-bases match, [0]

-M <int> Match score, [2]

-X <int> Mismatch score, [-5]

-I <int> Insertion score, [-2]

-D <int> Deletion score, [-4]

-H <float> Homopolymer merge score used in dp-call-cns mode, [-3]

-B <int> Bandwidth, [96]

-W <int> Window size in the middle of the first read for fast align remaining reads, [200]

If $W is negative, will disable fast align, but use the abs($W) as Band align score cutoff

-w <int> Min size of aligned size in window, [$W * 0.5]

In sorted-SAM input mode, -w is the sliding window size [2000]

-A Abort TriPOA when any read cannot be fast aligned, then try POA

-S <int> Shuffle mode, 0: don't shuffle reads, 1: by shared kmers, 2: subsampling. [1]

-R <int> Realignment bandwidth, 0: disable, [16]

-c <int> Consensus mode: 0, run-length; 1, dp-call-cns, [0]

-C <int> Min count of bases to call a consensus base, [3]

-F <float> Min frequency of non-gap bases to call a consensus base, [0.5]

-N <int> Max number of reads in PO-MSA [20]

Keep in mind that I am not going to generate high accurate consensus sequences here

-x <string> Presets,

sam-sr: polishs contigs from short reads mapping, accepts sorted SAM files

shorted for '-w 200 -j 150 -R 0 -b 1 -c 1 -N 50 -rS 2'

-v Verbose

-V Print version information and then exit

> > wtdbg2 -V

$ wtdbg2 -V

wtdbg2 2.3

2019年3月16日現在v2.3が入る。

実行方法

１、ドラフトアセンブリ

wtdbg2 -t 16 -i reads.fa.gz -fo prefix

-i Long reads sequences file (REQUIRED; there can be multiple -i),
-t Number of threads, 0 for all cores, [4]
-f Force to overwrite output files
-o Prefix of output files (REQUIRED),

複数のファイルができる。

f:id:kazumaxneo:20181203123621j:plain

7GBのONTリードを使ったランタイムはジャスト６分だった（*1）。

２、コンセンサスコールを行いfastaファイル出力

wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.lay.fa

7GBのONTリードを使ったランタイムは10分ほどだった（*1）。

ランが終わると"output.ctg.lay.fa"ができる。

引用

Fast and accurate long-read assembly with wtdbg2
Jue Ruan, Heng Li

bioRxiv preprint first posted online Jan. 26, 2019

Fast and accurate long-read assembly with wtdbg2

Jue Ruan, Heng Li
Nature Methods (2019)

mac pro 2012でdockerを使用してラン。thread==16

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

高速なロングリードのアセンブリツール wtdbg2 (Redbean)