MetaBAT - macでインフォマティクス

2019 8/28 追記

2019 9/30 metabat2紹介リンク追加

　ハイスループットのメタゲノムショットガンシークエンシングは、環境から採取された微生物群集を直接研究するための強力なツールであり、それによって培養から解放され、また培養から生じる可能性のあるバイアスを回避する。ショートメタゲノムショットガンリードのアセンブリでは、ショートリードアセンブラによって大きなゲノムフラグメント（コンティグ）を組み立てるが、完全長ゲノムの作成にはよく失敗する（Pevzner＆Tang、2001; Pevzner、Tang＆Waterman、2001）。メタゲノムコンティグのメタゲノムビニングによるドラフトゲノム予測は、完全長ゲノムの代替となる（Mande、Mohammed＆Ghosh、2012; Mavromatis et al、2007）。断片にもかかわらず、これらはしばしば個々の種（または異なる株のコンセンサス配列を表す「population genomes」（Imelfort et al、2014））のほぼ完全な遺伝子セットを持ち、ドラフトゲノムを近似している。

　2つのメタゲノムビニング手法が開発されている（Mande、Mohammed＆Ghosh、2012）。supervised（教師あり）のビニングアプローチは、既知のゲノムをリファレンスとして使用し、ビニングのため配列相同性または配列組成類似性に頼っている（Krause et al、2008; Wu＆Eisen、2008）。このアプローチは、多くの微生物が既知のゲノムとclosely relatedな種を持たない環境サンプルではうまく機能しない。対照的に、unsupervised（教師なし）のアプローチは、区別可能な配列組成（Teeling et al、2004b; Yang et al、2010）または種（またはゲノム断片）のco-abundance(Cotillard et al., 2013; Le Chatelier et al., 2013; Nielsen et al., 2014; Qin et al., 2012; Wu & Ye, 2011)、またはその両方(Albertsen et al., 2013（紹介）; Alneberg et al., 2014; Imelfort et al., 2014; Sharon et al., 2013; Wrighton et al., 2012; Wu et al., 2014) をビニングに使っている。近年の研究では、利用可能なサンプルが多数ある場合には、種の co-abundanceのフィーチャが複雑なコミュニティをデコンボリューションするのに非常に有効であることが示されている (Albertsen et al., 2013; Alneberg et al., 2014; Cotillard et al., 2013; Imelfort et al., 2014; Karlsson et al., 2013; Le Chatelier et al., 2013; Nielsen et al., 2014; Sharon et al., 2013) 。最近の少しだけ、CONCOCT（Alneberg et al、2014）とGroopM（Imelfort et al、2014）のような完全自動ビニング方法も報告されている。

　上記のツールの多くは、大きなメタゲノミックデータセットには適していない。この研究では、数千のサンプルから数百万のコンティグをビニングすることができる効率的で完全自動化されたソフトウェアツールMetaBAT（Metagenome Binning with Abundance and Tetra-Nucleotide Frequency）を開発した。テトラヌクレオチド頻度（TNF）とコンティグの存在量確率を組み合わせるための新しい統計的枠組みを用いることにより、MetaBATは高品質のゲノムビンを産生することを示した。

　ビニングの前提条件として、ユーザーは、各サンプルのリードをアセンブルされたメタゲノムに個別にマッピングすることによってBAMファイルを作成する必要がある（論文　図1のステップ1-3 link）。 MetaBATはアセンブリファイル（fastaフォーマット、必須）とソートされたbamファイル（サンプルごとに1つ、任意）を入力として受け入れる。メタゲノムアセンブリ中の各コンティグペアについて、MetaBATは、テトラヌクレオチド頻度（TNF）および存在量（すなわち、平均ベースカバレッジ）に基づいて確率的距離を計算し、次いで、2つの距離を1つの合成距離に統合する。全てのペアワイズ距離はマトリックスを形成し、その後、modifyされたk-medoidクラスタリングアルゴリズムに供給され、コンティグを反復的かつ完全にゲノムビンにビニングする（論文　図1）。

MetaBAT overview。論文図１より転載

インストール

mac os10.13でテストした。

依存

boost >= 1.55.0
python >= 2.7
scons >= 2.1.0
g++ >= 4.9
zlib >= 1.2.4
binutils >= 2.2.2

本体　BitBucket

#Anaconda環境ならcondaで導入可能。version2が入る（paper）。
conda install -c ursky metabat2

#dockerでもランできる。
docker pull metabat/metabat
docker run metabat/metabat:latest runMetaBat.sh

> runMetaBat.sh

$ runMetaBat.sh

/Users/user/.pyenv/versions/anaconda2-4.2.0/bin/runMetaBat.sh <select metabat options> assembly.fa sample1.bam [ sample2.bam ...]

You can specify any metabat options EXCEPT:

-i --inFile

-o --outFile

-a --abdFile

For full metabat options: metabat2 -h

——

> metabat2

$ metabat2

MetaBAT: Metagenome Binning based on Abundance and Tetranucleotide frequency (version 2.12.1; Aug 31 2017 21:02:54)

by Don Kang (ddkang@lbl.gov), Feng Li, Jeff Froula, Rob Egan, and Zhong Wang (zhongwang@lbl.gov)

Allowed options:

-h [ --help ] produce help message

-i [ --inFile ] arg Contigs in (gzipped) fasta file format [Mandatory]

-o [ --outFile ] arg Base file name and path for each bin. The default output is fasta format.

Use -l option to output only contig names [Mandatory].

-a [ --abdFile ] arg A file having mean and variance of base coverage depth (tab delimited;

the first column should be contig names, and the first row will be

considered as the header and be skipped) [Optional].

-m [ --minContig ] arg (=2500) Minimum size of a contig for binning (should be >=1500).

--maxP arg (=95) Percentage of 'good' contigs considered for binning decided by connection

among contigs. The greater, the more sensitive.

--minS arg (=60) Minimum score of a edge for binning (should be between 1 and 99). The

greater, the more specific.

--maxEdges arg (=200) Maximum number of edges per node. The greater, the more sensitive.

--pTNF arg (=0) TNF probability cutoff for building TNF graph. Use it to skip the

preparation step. (0: auto).

--noAdd Turning off additional binning for lost or small contigs.

--cvExt When a coverage file without variance (from third party tools) is used

instead of abdFile from jgi_summarize_bam_contig_depths.

-x [ --minCV ] arg (=1) Minimum mean coverage of a contig in each library for binning.

--minCVSum arg (=1) Minimum total effective mean coverage of a contig (sum of depth over

minCV) for binning.

-s [ --minClsSize ] arg (=200000) Minimum size of a bin as the output.

-t [ --numThreads ] arg (=0) Number of threads to use (0: use all cores).

-l [ --onlyLabel ] Output only sequence labels as a list in a column without sequences.

--saveCls Save cluster memberships as a matrix format

--unbinned Generate [outFile].unbinned.fa file for unbinned contigs

--noBinOut No bin output. Usually combined with --saveCls to check only contig

memberships

--seed arg (=0) For exact reproducibility. (0: use random seed)

-d [ --debug ] Debug output

-v [ --verbose ] Verbose output

[Error!] There was no --inFile specified

[Error!] There was no --outFile specified

> metabat1

$ metabat1

MetaBAT: Metagenome Binning based on Abundance and Tetranucleotide frequency (version 0.32.5; Aug 31 2017 21:02:53)

by Don Kang (ddkang@lbl.gov), Jeff Froula, Rob Egan, and Zhong Wang (zhongwang@lbl.gov)

Allowed options:

-h [ --help ] produce help message

-i [ --inFile ] arg Contigs in (gzipped) fasta file format [Mandatory]

-o [ --outFile ] arg Base file name for each bin. The default output is fasta format. Use -l

option to output only contig names [Mandatory]

-a [ --abdFile ] arg A file having mean and variance of base coverage depth (tab delimited;

the first column should be contig names, and the first row will be

considered as the header and be skipped) [Optional]

--cvExt When a coverage file without variance (from third party tools) is used

instead of abdFile from jgi_summarize_bam_contig_depths

-p [ --pairFile ] arg A file having paired reads mapping information. Use it to increase

sensitivity. (tab delimited; should have 3 columns of contig index

(ordered by), its mate contig index, and supporting mean read coverage.

The first row will be considered as the header and be skipped) [Optional]

--p1 arg (=0) Probability cutoff for bin seeding. It mainly controls the number of

potential bins and their specificity. The higher, the more (specific)

bins would be. (Percentage; Should be between 0 and 100)

--p2 arg (=0) Probability cutoff for secondary neighbors. It supports p1 and better be

close to p1. (Percentage; Should be between 0 and 100)

--minProb arg (=0) Minimum probability for binning consideration. It controls sensitivity.

Usually it should be >= 75. (Percentage; Should be between 0 and 100)

--minBinned arg (=0) Minimum proportion of already binned neighbors for one's membership

inference. It contorls specificity. Usually it would be <= 50

(Percentage; Should be between 0 and 100)

--verysensitive For greater sensitivity, especially in a simple community. It is the

shortcut for --p1 90 --p2 85 --pB 20 --minProb 75 --minBinned 20

--minCorr 90

--sensitive For better sensitivity [default]. It is the shortcut for --p1 90 --p2 90

--pB 20 --minProb 80 --minBinned 40 --minCorr 92

--specific For better specificity. Different from --sensitive when using correlation

binning or ensemble binning. It is the shortcut for --p1 90 --p2 90 --pB

30 --minProb 80 --minBinned 40 --minCorr 96

--veryspecific For greater specificity. No correlation binning for short contig

recruiting. It is the shortcut for --p1 90 --p2 90 --pB 40 --minProb 80

--minBinned 40

--superspecific For the best specificity. It is the shortcut for --p1 95 --p2 90 --pB 50

--minProb 80 --minBinned 20

--minCorr arg (=0) Minimum pearson correlation coefficient for binning missed contigs to

increase sensitivity (Helpful when there are many samples). Should be

very high (>=90) to reduce contamination. (Percentage; Should be between

0 and 100; 0 disables)

--minSamples arg (=10) Minimum number of sample sizes for considering correlation based

recruiting

-x [ --minCV ] arg (=1) Minimum mean coverage of a contig to consider for abundance distance

calculation in each library

--minCVSum arg (=2) Minimum total mean coverage of a contig (sum of all libraries) to

consider for abundance distance calculation

-s [ --minClsSize ] arg (=200000) Minimum size of a bin to be considered as the output

-m [ --minContig ] arg (=2500) Minimum size of a contig to be considered for binning (should be >=1500;

ideally >=2500). If # of samples >= minSamples, small contigs (>=1000)

will be given a chance to be recruited to existing bins by default.

--minContigByCorr arg (=1000) Minimum size of a contig to be considered for recruiting by pearson

correlation coefficients (activated only if # of samples >= minSamples;

disabled when minContigByCorr > minContig)

-t [ --numThreads ] arg (=0) Number of threads to use (0: use all cores)

--minShared arg (=50) Percentage cutoff for merging fuzzy contigs

--fuzzy Binning with fuzziness which assigns multiple memberships of a contig to

bins (activated only with --pairFile at the moment)

-l [ --onlyLabel ] Output only sequence labels as a list in a column without sequences

-S [ --sumLowCV ] If set, then every sample that falls below the minCV will be used in an

aggregate sample

-V [ --maxVarRatio ] arg (=0) Ignore any contigs where variance / mean exceeds this ratio (0 disables)

--saveTNF arg File to save (or load if exists) TNF matrix for each contig in input

--saveDistance arg File to save (or load if exists) distance graph at lowest probability

cutoff

--saveCls Save cluster memberships as a matrix format

--unbinned Generate [outFile].unbinned.fa file for unbinned contigs

--noBinOut No bin output. Usually combined with --saveCls to check only contig

memberships

-B [ --B ] arg (=20) Number of bootstrapping for ensemble binning (Recommended to be >=20)

--pB arg (=50) Proportion of shared membership in bootstrapping. Major control for

sensitivity/specificity. The higher, the specific. (Percentage; Should be

between 0 and 100)

--seed arg (=0) For reproducibility in ensemble binning, though it might produce slightly

different results. (0: use random seed)

--keep Keep the intermediate files for later usage

-d [ --debug ] Debug output

-v [ --verbose ] Verbose output

[Error!] There was no --inFile specified

[Error!] There was no --outFile specified

——

使い方

アセンブリしたfastaファイルと、そのfastaにリードをマッピングして作ったbamを指定する。

runMetaBat.sh assembly.fasta sample1.bam [sample2.bam ...]

assembly.fasta .metabat-binsディレクトリができ、その中にビニングされたfastaが出力される。また、カレントにcontigそれぞれのリードデプスを記したテキストファイルassembly.fasta.depth.txtが出力される。

デプスファイルを指定し、より詳細なパラメータを指定してランすることもできる。"-a"で先ほどのリードデプス出力ファイルを指定する。

metabat2 -i input.fasta -a depth.txt -o output -t 12 -m 1500 -v --unbinned

-i Contigs in (gzipped) fasta file format [Mandatory]
-o Base file name and path for each bin. The default output is fasta format.
Use -l option to output only contig names [Mandatory]
-t (=0) Number of threads to use (0: use all cores).
-m (=2500) Minimum size of a contig for binning (should be >=1500).
-v Verbose output
--unbinned Generate [outFile].unbinned.fa file for unbinned contigs
--saveCls Save cluster memberships as a matrix format
--noBinOut No bin output. Usually combined with --saveCls to check only contig memberships