Minimizer-spaceの de Bruijn graphsを構築し、超高速・低メモリアセンブリを行う rust-mdbg

2021 9/17 論文引用

2023/08/03 更新（metaMDBGとしてまちがって更新していた分を移動）

DNAシーケンサーのデータは、リードが長くなり、シーケンスエラー率がますます低くなる傾向にある。ここでは、このようなリードをゲノムにアセンブルする問題に注目している。最新のアセンブル手法、例えばminimizer sketchesを用いたオーバーラップリードに基づく手法を用いた場合、精度と計算資源の面で課題がある。この論文では、DNAヌクレオチドではなく、minimizers化されたものをDNAのアルファベットの原子トークンとするminimizer-space sequencingデータ解析の概念を紹介する。DNA配列を最小化子の順序付きリストに投影することで、k-min-merと呼ぶminimizer tokensからなる大きなアルファベット上のk-merを列挙することが重要なアイデアである。本手法であるmdBG（minimizer-dBG）は、既存の手法と比較して、精度を大きく落とすことなく、速度とメモリ使用量の両方で桁違いの改善を達成している。mdBGは、ヒトゲノムのアセンブリ、メタゲノムのアセンブリ、そして大規模なパンゲノムという3つのユースケースで実証した。アセンブリでは、mdBGをrust-mdbgと呼ぶソフトウェアに実装し、PacBio HiFiリードの超高速・低メモリ・高連続性アセンブリを実現した。ヒトゲノムのアセンブルは、8コアと10GBのRAMを使用して10分以内に完了し、60Gbpのメタゲノムリードのアセンブルは、1GBのRAMを使用して4分で完了した。パンゲノム・グラフについては、新たに661,405個の細菌ゲノムのコレクションをmdBGとしてグラフ表示できるようにし、それを（minimizer-spaceで）抗微生物耐性（AMR）遺伝子を検索することに成功した。ゲノミクス、メタゲノミクス、パンゲノミクスの分野では、ロングリードシークエンスが普及しており、今回の成果はシークエンス解析に欠かせないものになると期待している。

インストール

Github

git clone https://github.com/ekimb/rust-mdbg.git
cd rust-mdbg/
cargo build --release

#conda
mamba install -c bioconda rust-mdbg -y

グラフの簡略化を行うには、gfatoolsも必要。

> rust-mdbg

rust-mdbg 0.1.0

Original implementation of minimizer-space de Bruijn graphs (mdBG) for genome assembly.

rust-mdbg is an ultra-fast minimizer-space de Bruijn graph (mdBG) implementation, geared towards the assembly of long

and accurate reads such as PacBio HiFi. rust-mdbg is fast because it operates in minimizer-space, meaning that the

reads, the assembly graph, and the final assembly, are all represented as ordered lists of minimizers, instead of

strings of nucleotides. A conversion step then yields a classical base-space representation.

USAGE:

rust-mdbg [FLAGS] [OPTIONS] <reads>

FLAGS:

--bf Enable Bloom filters

--debug Activate debug mode

--error-correct Enable error correction with minimizer-space POA

-h, --help Prints help information

--hpc Homopolymer-compressed (HPC) input

--reference Reference genome input

--restart-from-postcor Assemble error-corrected reads

-V, --version Prints version information

OPTIONS:

--correction-threshold <correction-threshold> POA correction threshold

-d, --density <density> Density threshold for density-based selection scheme

--distance <distance> Distance metric (0: Jaccard, 1: containment, 2: Mash)

-k, --k <k> k-min-mer length

-l, --l <l> l-mer (minimizer) length

--lcp <lcp> Core substring file (enables locally consistent parsing (LCP))

--lmer-counts <lmer-counts> l-mer counts (enables downweighting of frequent l-mers)

--lmer-counts-max <lmer-counts-max> Maximum l-mer count threshold

--lmer-counts-min <lmer-counts-min> Minimum l-mer count threshold

--minabund <minabund> Minimum k-min-mer abundance

-n, --n <n> Tuple length for bucketing similar reads

-p, --prefix <prefix> Output prefix for GFA and .sequences files

--presimp <presimp> Pre-simplification (pre-simp) threshold

-t, --t <t> POA path weight threshold

--threads <threads> Number of threads

--uhs <uhs> Universal k-mer file (enables universal hitting sets (UHS))

ARGS:

<reads> Input file (raw or gzip-/lz4-compressed FASTX)

テストラン

#1 assembly
rust-mdbg example/reads-0.00.fa.gz -k 7 --density 0.0008 -l 10 --minabund 2 --prefix example

#2 グラフの簡略化
utils/magic_simplify example

出力

最終出力は、minimizer-space de Bruijnグラフを含む.gfaファイル（配列なし）、
グラフのノードのシーケンスを含む複数の.sequencesファイル。
実行ファイルto_basespaceを使うと、両方の出力を結合し、シーケンス付きの.gfaファイルを生成することができる（condaで導入するとパスは通っている）。

より良い連続性を得るには、提供されているマルチkアセンブリスクリプトmultik を使う（ラッパーなのでパスは通っていない）。このスクリプトは、k= 10から始まり、自動的に決定される最大k値まで、繰り返しアセンブルを行う。ここでは使用スレッド数は20と指定している。

utils/multik reads.fq.gz output_prefix 20

出力例

rust-mdbgはモジュール式アセンブラである。3つのコンポーネントから構成されている（レポジトリより）。magic_simplify は

rust-mdbg（minimizer-spaceでアセンブルを実行するコンポーネント
gfatools（外部コンポーネント）：グラフの簡略化を行う。
to_basespace：最小化空間アセンブルを基底空間に変換する。

便宜上、コンポーネント2と3はmagic_simplifyスクリプト（テストランのstep2）にまとめられている。

引用

Minimizer-space de Bruijn graphs
Barış Ekim, Bonnie Berger, Rayan Chikhi

bioRxiv, Posted June 10, 2021.

2021 9/17

Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer

Barış Ekim, Bonnie Berger , Rayan Chikhi

Cell Syst. 2021 Sep 14;S2405-4712(21)00332-X

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Minimizer-spaceの de Bruijn graphsを構築し、超高速・低メモリアセンブリを行う rust-mdbg