高速かつメモリ使用量の少ないメタゲノムアセンブリツール MEGAHIT

2019 5/6　インストール方法修正、5/7　パラメータ追記、5/15 タイトル修正、5/19 リンク追加、5/17 タイトル再修正、パラメータ修正、6/3 コメント追加、7/27 condaインストール追記

2021 5/7 helpとインストール手順更新、7/21 リンク追加

2022 2/19 メモ追記

2024/03/21 v1.2.2 0 => v1.2.9

　次世代シーケンシング技術は、メタゲノミクスを研究し、ヒトの腸、動物の第一胃および土壌などの様々な微生物群を理解する新しい機会を提供してきた。リファレンスゲノムの欠如のため、メタゲノミクスデータのde novo assemblyは、メタゲノミクス分析のための有益かつほぼ不可避なステップである（Qin et al、2010）。しかし、このステップは、特に、環境メタゲノミクスで遭遇する大規模かつ複雑なデータセット（Howe et al、2014）の重い計算資源要件による制約がある。 Howeらによって最近公表された土壌メタゲノミクスデータセットは、低クオリティの塩基をトリミングした後でさえ、252Gbp含まれる。データセットは、パーティショニングとdigital normalization（紹介）を含む前処理ステップでうまくアセンブリされた。現時点では、デノボ・アセンブラは、コンピュータ・メモリを使用してデータの全体をアセンブリすることはできない。土壌のメタゲノムデータをアセンブリするためのSOAPdenovo2（Luo et al、2012）およびIDBA-UD（Peng et al、2012）の推定メモリ要件は少なくとも4 TBである。メタゲノミクスデータの量が増え続けるにつれて、特に単一ノードサーバー（現在の2ソケットのサーバーでは最大メモリ容量768GB（論文執筆時点））上で、大規模かつ複雑なメタゲノミクスデータを時間と費用効率の高い方法で組み立てることができるアセンブラMEGAHITを開発した。

MEGAHITのワークフロー。論文より転載。

wiki

https://github.com/voutcn/megahit/wiki

インストール

macosとubuntu18.04でテストした。

依存

zlib
python 2.6 or greater
G++ 4.4 or greater

本体　Github

#linux binary v1.2.9
wget https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
cd MEGAHIT-1.2.9-Linux-x86_64-static/bin/

#test run
./megahit --test # run on a toy dataset

#docker image
docker run -v $(pwd):/workspace -w /workspace --user $(id -u):$(id -g) vout/megahit \
megahit -1 YOUR_PE_READ_1.gz -2 YOUR_PE_READ_2.fq.gz -o YOUR_OUTPUT_DIR


#from source
git clone https://github.com/voutcn/megahit.git
cd megahit
git submodule update --init
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release # add -DCMAKE_INSTALL_PREFIX=YOUR_PREFIX if needed
make -j4
make simple_test # will test MEGAHIT with a toy dataset
# make install if needed

#bioconda (link)
mamba create -n megahit -y
conda activate megahit
mamba install -c bioconda -y megahit

> megahit

$ megahit

megahit: MEGAHIT v1.2.9

contact: Dinghua Li <voutcn@gmail.com>

Usage:

megahit [options] {-1 <pe1> -2 <pe2> | --12 <pe12> | -r <se>} [-o <out_dir>]

Input options that can be specified for multiple times (supporting plain text and gz/bz2 extensions)

-1 <pe1> comma-separated list of fasta/q paired-end #1 files, paired with files in <pe2>

-2 <pe2> comma-separated list of fasta/q paired-end #2 files, paired with files in <pe1>

--12 <pe12> comma-separated list of interleaved fasta/q paired-end files

-r/--read <se> comma-separated list of fasta/q single-end files

Optional Arguments:

Basic assembly options:

--min-count <int> minimum multiplicity for filtering (k_min+1)-mers [2]

--k-list <int,int,..> comma-separated list of kmer size

all must be odd, in the range 15-255, increment <= 28)

[21,29,39,59,79,99,119,141]

Another way to set --k-list (overrides --k-list if one of them set):

--k-min <int> minimum kmer size (<= 255), must be odd number [21]

--k-max <int> maximum kmer size (<= 255), must be odd number [141]

--k-step <int> increment of kmer size of each iteration (<= 28), must be even number [12]

Advanced assembly options:

--no-mercy do not add mercy kmers

--bubble-level <int> intensity of bubble merging (0-2), 0 to disable [2]

--merge-level <l,s> merge complex bubbles of length <= l*kmer_size and similarity >= s [20,0.95]

--prune-level <int> strength of low depth pruning (0-3) [2]

--prune-depth <int> remove unitigs with avg kmer depth less than this value [2]

--disconnect-ratio <float> disconnect unitigs if its depth is less than this ratio times

the total depth of itself and its siblings [0.1]

--low-local-ratio <float> remove unitigs if its depth is less than this ratio times

the average depth of the neighborhoods [0.2]

--max-tip-len <int> remove tips less than this value [2*k]

--cleaning-rounds <int> number of rounds for graph cleanning [5]

--no-local disable local assembly

--kmin-1pass use 1pass mode to build SdBG of k_min

Presets parameters:

--presets <str> override a group of parameters; possible values:

meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141'

meta-large: '--k-min 27 --k-max 127 --k-step 10'

(large & complex metagenomes, like soil)

Hardware options:

-m/--memory <float> max memory in byte to be used in SdBG construction

(if set between 0-1, fraction of the machine's total memory) [0.9]

--mem-flag <int> SdBG builder memory mode. 0: minimum; 1: moderate;

others: use all memory specified by '-m/--memory' [1]

-t/--num-cpu-threads <int> number of CPU threads [# of logical processors]

--no-hw-accel run MEGAHIT without BMI2 and POPCNT hardware instructions

Output options:

-o/--out-dir <string> output directory [./megahit_out]

--out-prefix <string> output prefix (the contig file will be OUT_DIR/OUT_PREFIX.contigs.fa)

--min-contig-len <int> minimum length of contigs to output [200]

--keep-tmp-files keep all temporary files

--tmp-dir <string> set temp directory

Other Arguments:

--continue continue a MEGAHIT run from its last available check point.

please set the output directory correctly when using this option.

--test run MEGAHIT on a toy test dataset

-h/--help print the usage message

-v/--version print version

パスの通ったディレクトリにコピーするなら、以下全てを移動する。

sudo cp megahit megahit_asm_core megahit_toolkit megahit_sdbg_build /usr/local/bin/

Dockerコンテナも複数上げられている（リンク）。バイナリもリリースから入手できる（リンク）。

open MPやCUDAにも対応してます。詳細はGithubで確認してください。

実行方法

ペアエンドfastqを指定してランする。後半のパラメータは省略可能。

megahit -1 pair1.fq -2 pair2.fq -o output --k-min 21 --k-max 141 --k-step 12 -t 40

interleaveのペアエンドは"-12"で指定する。kは奇数、kのステップは偶数にする。非常に複雑なsoilなどのメタゲノムは、de brujin graphをシンプルにするため、最小kを27などにすることが推奨されている（カバレッジの大きいデータも同様）。-min-countも重要で、カバレッジが十分（>40）以上あるゲノムがターゲットなら、クオリティトリミングを行い、--k-minも上げることが推奨されている。一方、カバレッジが不十分でも、cut-off 2以下はシーケンスエラーを高頻度に拾うため非推奨になっている。

シングルエンド

megahit -r single.fq -o output --k-min 21 --k-max 141 --k-step 12 -t 40

merged.fqを使う。

https://www.biostars.org/p/328994/

merged.fqを使用した時にlow k-merを指定してメモリエラーを吐く場合、少し値を上げることで対処療法的に対応可能。

megahit -1 pair1.fq -2 pair2.fq -o output --k-min 31 --k-max 121 --k-step 10 -t 40

sensitiveモード

#meta-sensitive
megahit -1 pair1.fq -2 pair2.fq -o output -t 40 --presets meta-sensitive

#meta-large (large & complex metagenomes, like soil))
megahit -1 pair1.fq -2 pair2.fq -o output -t 40 --presets meta-large

--presets <str> override a group of parameters; possible values:
meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141'
meta-large: '--k-min 27 --k-max 127 --k-step 10' (large & complex metagenomes, like soil)

途中で止めたジョブを再開するには--continueフラグを立て、当時の出力ディレクトリを指定する。プログラムのバージョンが同じなら他の計算機に移しても機能するはず（メモリが足りない時など）。

megahit -o output --continue

--continue　 continue a MEGAHIT run from its last available check point. please set the output directory correctly when using this option.

de brujin graphはbandageで可視化できる（詳細）。

megahit_toolkit contig2fastg 99 k99.contigs.fa > k99.fastg

SRAのpublicデータを使った実際のワークフローもあります（De novoアセンブリとカバレッジ計算のフロー）。

https://github.com/voutcn/megahit/wiki/An-example-of-real-assembly

GPUなしでも動作は非常に高速です。

追記

基本的に短いk-merから長いk-merまでk-mer stepを小さくしてiterativeにアセンブルを繰り貸せば、トータルconitgサイズは増加します。この時、merged.fastqも加えているならk>200以上までma k-merを増やしても伸ばすこともトータルconitgサイズは増加します。ただし、そうやって限界まで伸ばしたcontigが一番良いとは限りません。サブサンプリングしたデータを使い、いくつかのパラメータ設定でアセンブルをってmetaquast等で評価してみてください（ほとんどのゲノムはリファレンスから遠いので数株分のゲノムでの評価にしかなりませんが、それでもないよりずっとマシです）。

2022/02/19

250GBx2 gzipped fastq（非圧縮でおよそ1.5TB）のランでは、k=21は3.8日で計算できたが、続いてのk=33のアセンブリでメモリが足りなくなり落ちた（ピークメモリは523 giga bytes。サンプルの多様性が大きいデータを使用）。=> その後、BBtoolsのbbnorm.shを使って低頻度な配列を除外することによって200GBx２のgzipped fastqにまで減らしたところ、ピークメモリ475GB、ランタイム7日と１時間（53155 user CPU time (min)）でラン出来た。

引用

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

Li D, Luo R, Liu CM3, Leung CM, Ting HF, Sadakane K, Yamashita H, Lam TW

Methods. 2016 Jun 1;102:3-11.

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

Li D, Liu CM, Luo R, Sadakane K, Lam TW

Bioinformatics. 2015 May 15;31(10):1674-6.