複数のビニング手法を柔軟に組みわせてメタゲノムアセンブリから非冗長なビン配列を出力する DAS_Tool

2019 10/26 インストール追記

2021 4/29 変換コマンド追記、5/25、7/2 コマンド追記

2022/10/23 Fasta_to_Scaffolds2Bin.sh追記

2024/05/0 コマンド更新

　Genome-resolved metagenomics は、環境ショットガンDNAシーケンシングデータからのゲノムの再構築をターゲットとする。ゲノム配列に基づいて、個々の生物の代謝経路を推論することができ、微生物群におけるそれらの生活様式を予測することができる。シーケンス断片の複雑なmixturesからゲノムを回収するというチャレンジは、パズルの数がどれくらいで、どのように見えるかを知らずに多数のパズルの混合物からジグソーパズルを組み立てることに匹敵する。驚くべきことではないが、所望の結果を達成するために強力なバイオインフォマティクス法が必要とされる。

　初期のアプローチでは、主に、共有されたGCとカバレッジを利用したが（論文より　ref.1）、より複雑な生態系からのコンティグをビニングするには、テトラヌクレオチド頻度などの配列構成を考慮した高度な方法が必要だった（ref.2,3）。シーケンス構成分析は、 emergent self-organizing maps （ESOM）で実装され、メタゲノムからゲノムを首尾よく抽出した。ユーザー定義のクラスタリングを含むESOMベースのアプローチは、多くの異なる環境からゲノムのドラフトを回復するために広く使用されてきたが、土壌や堆積物などの複雑なデータセットには限界がある（ref.5,6）。ビニング手法の大きな進歩は、サンプルシリーズ全体にわたる生物存在量のパターンがビニングシグネチャであることを認識したことであった（ref.7,8）。

　微生物リファレンスゲノムの数が非常に少なかったため、メタゲノム時代の初期のphylogeneticプロファイル情報は最小限であった。しかしながら、phylogeneticシグナルは、リファレンスゲノム配列の数が増加するにつれて増大し続ける。

　現在の最先端のbinnersは、配列の存在量と組成を1つのモデルに統合する（ref.9,10,11,12）、さらにそれらのうちのいくつかは参照データベースのマーカー遺伝子を使用する（ref.13,14)。完全性と予測されたビンの汚染に関して品質評価は不可欠であり、シングルコピーマーカー遺伝子の頻度に基づいて推定することができる（ref.15,16）。

既存のビニングツールは、広く受け入れられている機能とクラスタリングアルゴリズムに基づいており、それぞれのpublicationsで分析されたデータセットを使用してベンチマークされている。実際、ほとんどのビニング方法は、比較的単純なコミュニティ（例えば、早産児の腸内データセット（ref.7）を用いて実証されている。しかし、これらのメソッドを他のサンプルに適用すると、生成されるビンの値は不確実である。ここでは、一連の確立されたビニング方法のパフォーマンスを、複雑さが劇的に変化する一連の生態系からのデータに適用することによってそのパフォーマンスをテストした。著者らは、すべての生態系でうまくいったアプローチがないことを発見した（論文の図を参照）。さらに、多くの不完全ビンおよびmulti-genome のメガビンが予測された。さまざまなビニング性能と、さまざまなツールが完全性のレベルを変えて異なるゲノムを再構築するという事実は、複数のビニングアルゴリズムの予測結果を統合する戦略の開発を促した。

Probst et al は 3つのビニング法の結果を組み合わせ、キュレーションし、地下の帯水層環境から再構成されたニアコンプリートゲノムの総数を、1つの方法（ref.17)よりも増加させた。自動ビニングコンビネーションアプローチは、ビンの汚染を全体的に減らすことができたが、全体的な完全性も低下させた（ref.18）。これらの発見は、dereplication、aggregation、 scoring tool （DAS Tool）の開発を促した。 DASツールは、flexibleな数のビニングアルゴリズムを統合して、単一のアセンブリから最適化された非冗長なビンセットを計算する自動化された方法である。この手法が単一のツールを使用した場合よりも多くの高品質ゲノムを生成することを示す。

インストール

依存

R (>= 3.2.3): https://www.r-project.org
R-packages: data.table (>= 1.9.6), doMC (>= 1.3.4), ggplot2 (>= 2.1.0)
ruby (>= v2.3.1): https://www.ruby-lang.org
Pullseq (>= 1.0.2): https://github.com/bcthomas/pullseq
Prodigal (>= 2.6.3): https://github.com/hyattpd/Prodigal
coreutils (only macOS/ OS X): https://www.gnu.org/software/coreutils
One of the following search engines:
- USEARCH (>= 8.1): http://www.drive5.com/usearch/download.html
- DIAMOND (>= 0.9.14): https://ab.inf.uni-tuebingen.de/software/diamond
- BLAST+ (>= 2.5.0): https://blast.ncbi.nlm.nih.gov/Blast.cgi

Github

#bioconda (link)
conda install -c bioconda -y das_tool

git clone https://github.com/cmks/DAS_Tool.git
cd DAS_Tool/

# Install R-packages:
R CMD INSTALL ./package/DASTool_1.1.1.tar.gz

# Unzip SCG database:
unzip ./db.zip -d db

> ./DAS_Tool

$ ./DAS_Tool

DAS Tool version 1.1.1

Usage: DAS_Tool -i methodA.scaffolds2bin,...,methodN.scaffolds2bin

-l methodA,...,methodN -c contigs.fa -o myOutput

-i, --bins Comma separated list of tab separated scaffolds to bin tables.

-c, --contigs Contigs in fasta format.

-o, --outputbasename Basename of output files.

-l, --labels Comma separated list of binning prediction names. (optional)

--search_engine Engine used for single copy gene identification [blast/diamond/usearch].

(default: usearch)

--write_bin_evals Write evaluation for each input bin set [0/1]. (default: 1)

--create_plots Create binning performance plots [0/1]. (default: 1)

--write_bins Export bins as fasta files [0/1]. (default: 0)

--write_unbinned Export unbinned contigs as fasta file. Only has an effect when write_bins==1 [0/1]. (default: 0)

--proteins Predicted proteins in prodigal fasta format (>scaffoldID_geneNo).

Gene prediction step will be skipped if given. (optional)

-t, --threads Number of threads to use. (default: 1)

--score_threshold Score threshold until selection algorithm will keep selecting bins [0..1].

(default: 0.5)

--duplicate_penalty Penalty for duplicate single copy genes per bin (weight b).

Only change if you know what you're doing. [0..3]

(default: 0.6)

--megabin_penalty Penalty for megabins (weight c). Only change if you know what you're doing. [0..3]

(default: 0.5)

--db_directory Directory of single copy gene database. (default: install_dir/db)

--resume Use existing predicted single copy gene files from a previous run [0/1]. (default: 0)

--debug Write debug information to log file.

-v, --version Print version number and exit.

-h, --help Show this message.

Example 1: Run DAS Tool on binning predictions of MetaBAT, MaxBin, CONCOCT and tetraESOMs. Output files will start with the prefix DASToolRun1:

DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv,sample_data/sample.human.gut_metabat_scaffolds2bin.tsv,sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv -l concoct,maxbin,metabat,tetraESOM -c sample_data/sample.human.gut_contigs.fa -o sample_output/DASToolRun1

Example 2: Run DAS Tool again with different parameters. Use the proteins predicted in Example 1 to skip the gene prediction step. Set the number of threads to 2 and score threshold to 0.1. Output files will start with the prefix DASToolRun2:

Please cite: Sieber et al., 2018, Nature Microbiology (https://doi.org/10.1038/s41564-018-0171-1).

またはdockerイメージを使う（リンク）。

docker pull shengwei/das_tool
docker rn -it shengwei/das_tool

テストラン

MetaBAT、MaxBin、CONCOCT、tetraESOMsのデータを使用。20スレッド指定。

DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,\
sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv,\
sample_data/sample.human.gut_metabat_scaffolds2bin.tsv,\
sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv \
-l concoct,maxbin,metabat,tetraESOM \
-c sample_data/sample.human.gut_contigs.fa \
-t 20 \
-o sample_output/DASToolRun1 \
--write_bins 1

複数ファイルが出力される。

f:id:kazumaxneo:20181001164137j:plain

DASToolRun1_DASTool_hqBins.pdf

f:id:kazumaxneo:20181001164206j:plain

DASToolRun1_DASTool_scores.pdf

f:id:kazumaxneo:20181001164228j:plain

summary（TSVファイルをexcelで表示した）

f:id:kazumaxneo:20181001164745j:plain

Output files

Summary of output bins including quality and completeness estimates (DASTool_summary.txt).
Scaffolds to bin file of output bins (DASTool_scaffolds2bin.txt).
Quality and completeness estimates of input bin sets, if --write_bin_evals 1 is set ([method].eval).
Plots showing the amount of high quality bins and score distribution of bins per method, if --create_plots 1 is set (DASTool_hqBins.pdf, DASTool_scores.pdf).
Bins in fasta format if --write_bins 1 is set (DASTool_bins).

Githubより。

追記

タブ区切りのTSVファイルはFasta_to_Contig2Bin.sh (以前はFasta_to_Scaffolds2Bin.shだった) を使って作成できる。

git clone https://github.com/cmks/DAS_Tool.git
bash DAS_Tool/src/Fasta_to_Contig2Bin.sh -i /maxbin2_outdir -e fa > maxbin2.tsv

2021 5/26

maxbint2とmetabat2. ペナルティ値を下げる。大規模なデータセットならdiamondを使う。

DAS_Tool -i maxbin2.tsv,metabat2.tsv \
-l maxbin2,metabat2 \
-c contigs.fa \
-t 20 \
-o DASTool \
--write_bins 1 \
--create_plots 1 \
--score_threshold 0.3 \
--duplicate_penalty 0.3 \
--megabin_penalty 0.3 \
 --search_engine diamond

#2024/5追記　新しいバージョンのDASToolsはオプションが少し変更されている
DAS_Tool -i maxbin2.tsv,metabat2.tsv \
-l maxbin2,metabat2 \
-c contigs.fa \
-t 20 \
-o DASTool \
--search_engine diamond \
--write_bins \
--score_threshold 0.3 \
--duplicate_penalty 0.3 \
--megabin_penalty 0.3

引用

Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy
Christian M. K. Sieber, Alexander J. Probst, Allison Sharrar, Brian C. Thomas, Matthias Hess, Susannah G. Tringe, Jillian F. Banfield
Nat Microbiol. 2018 Jul;3(7):836-843