メタゲノムのbinner評価ツール AMBER - macでインフォマティクス

　ショットガンシーケンシングのMetagenomicsにより、微生物のコミュニティとそのメンバーを研究できる。進化的発散とこれらのメンバーの豊富さは大きな違いがあり、strainレベルの非常にclosely relatedなメンバーだったり、進化的に大きく離れていたり、豊富さ(abundance)が数桁異なることもある。ゲノムビニングソフトウェアは、メタゲノミックリードまたはアセンブリ配列を、コミュニティメンバーのゲノムを表すビンに紐解く (deconvolutes)。ゲノムビニングの一般的で実践的なアプローチは、1つ以上の共同アセンブリにわたってカバレッジとshort k-mer組成が共変動することを用いるが、ストレインレベルの多様性が存在するとビン品質は実質的に低下する[ref. 1]。

　アセンブリやプロファイリングなどのメタゲノミクスベンチマーク方法は、ユーザーとメソッド開発者にとって重要である。前者は、特定のアプリケーションやデータセットに最適なプログラムとパラメータ化を決定する必要があり、後者は、新規または改善された方法を既存のものと比較する必要がある。評価ソフトウェアや標準化されたメトリクスがない場合、どちらも個別にメソッドを評価するためにかなりの労力を費やす必要がある。The Critical Assessment of Metagenome Interpretation （CAMI）は、ベンチマーク・データセットとパフォーマンス・メトリックの設計を含む評価基準とベスト・プラクティスを確立することによって、この問題に取り組むことを目指す、コミュニティ主導の取り組みである[ref.1、2 pubmed, link]。コミュニティの要求と提案に従って、最初のCAMIの課題は、参加者が集合体、分類学的およびゲノムのビニング、および分類学的プロファイリングの結果を提出することができる、異なる生物学的複雑性を有する微生物群集のメタゲノムデータセットを提供した。その後、コミュニティによって選択されたメトリックを使用して評価された[ref.1]。ここでは、メタゲノムベンチマークデータセットからのゲノムビニング再構成の比較評価のためのAssessment of Metagenome BinnERs（AMBER）ソフトウェアパッケージについて説明する。このツールは最初のCAMIの課題であった、ゲノム再構築の質を評価するのに最も関連するすべての指標を実装しており、任意のベンチマークデータセットに適用可能である。 AMBERは、フラットファイルでビニング品質評価出力を自動的に生成し、サマリーテーブル、ランキング、イメージ内のビジュアライゼーション、インタラクティブHTMLページとして生成する。AMBERは、シングルコピーマーカー遺伝子のセットに基づいてメタゲノムサンプルのビン品質を評価する人気のCheckMソフトウェアを補完する[ref.3]。

インストール

ubuntu16.04のpython3.5.0環境でテストした。

依存

numpy==1.13.0
biopython==1.69.0
matplotlib==2.0.2
bokeh==0.12.9
pandas==0.20.3
seaborn==0.8.1

Optional

tox, for automatic tests
LaTeX, for combining plots into a PDF file with tool src/create_summary_pdf.py

GIthub

pip3 install cami-amber

usage: AMBER [-h] -g GOLD_STANDARD_FILE [-f FASTA_FILE] [-l LABELS]

[-p FILTER] [-r REMOVE_GENOMES] [-k KEYWORD] -o OUTPUT_DIR [-m]

[-x MIN_COMPLETENESS] [-y MAX_CONTAMINATION] [-v]

bin_files [bin_files ...]

Compute all metrics and figures for one or more binning files; output summary

to screen and results per binning file to chosen directory

positional arguments:

bin_files Binning files

optional arguments:

-h, --help show this help message and exit

-g GOLD_STANDARD_FILE, --gold_standard_file GOLD_STANDARD_FILE

Gold standard - ground truth - file

-f FASTA_FILE, --fasta_file FASTA_FILE

FASTA or FASTQ file with sequences of gold standard

(required if gold standard file misses column _LENGTH)

-l LABELS, --labels LABELS

Comma-separated binning names

-p FILTER, --filter FILTER

Filter out [FILTER]% smallest bins (default: 0)

-r REMOVE_GENOMES, --remove_genomes REMOVE_GENOMES

File with list of genomes to be removed

-k KEYWORD, --keyword KEYWORD

Keyword in the second column of file with list of

genomes to be removed (no keyword=remove all genomes

in list)

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Directory to write the results to

-m, --map_by_completeness

Map genomes to bins by maximizing completeness

-x MIN_COMPLETENESS, --min_completeness MIN_COMPLETENESS

Comma-separated list of min. completeness thresholds

(default %: 50,70,90)

-y MAX_CONTAMINATION, --max_contamination MAX_CONTAMINATION

Comma-separated list of max. contamination thresholds

(default %: 10,5)

-v, --version show program's version number and exit

——

または著者らが用意したdockerイメージを使う。

git clone https://github.com/CAMI-challenge/AMBER.git
cd AMBER/
docker build -t cami/amber:latest .


docker run -v \
$(pwd)/input/gold_standard.fasta:/bbx/input/gold_standard.fasta \
-v $(pwd)/input/gsa_mapping.binning:/bbx/input/gsa_mapping.binning \
-v $(pwd)/input/test_query.binning:/bbx/input/test_query.binning \
-v $(pwd)/output:/bbx/output \
-v $(pwd)/input/biobox.yaml:/bbx/input/biobox.yaml \
cami/amber:latest default

テストラン

cd AMBER/

python3 amber.py -g test/gsa_mapping.binning \
-l "MaxBin 2.0, CONCOCT, MetaBAT" \
-p 1 \
-r test/unique_common.tsv \
-k "circular element" \
test/naughty_carson_2 \
test/goofy_hypatia_2 \
test/elated_franklin_0 \
-o output_dir/

出力(output_dir/)

> ls -alth /data/output_dir/

$ ls -alth /data/output_dir/

total 744K

drwxr-xr-x 15 kazu kazu 510 Oct 13 03:24 ..

drwxrwxr-x 29 kazu kazu 986 Oct 13 03:23 .

-rw-rw-r-- 1 kazu kazu 82K Oct 13 03:23 summary.html

-rw-rw-r-- 1 kazu kazu 88K Oct 13 03:23 purity_completeness_per_bin.eps

-rw-rw-r-- 1 kazu kazu 26K Oct 13 03:23 purity_completeness_per_bin.pdf

-rw-rw-r-- 1 kazu kazu 45K Oct 13 03:23 purity_completeness_per_bin.png

-rw-rw-r-- 1 kazu kazu 310 Oct 13 03:23 rankings.txt

-rw-rw-r-- 1 kazu kazu 25K Oct 13 03:23 boxplot_completeness.eps

-rw-rw-r-- 1 kazu kazu 14K Oct 13 03:23 boxplot_completeness.pdf

-rw-rw-r-- 1 kazu kazu 19K Oct 13 03:23 boxplot_completeness.png

-rw-rw-r-- 1 kazu kazu 20K Oct 13 03:23 boxplot_completeness_wo_legend.eps

-rw-rw-r-- 1 kazu kazu 24K Oct 13 03:23 boxplot_purity.eps

-rw-rw-r-- 1 kazu kazu 13K Oct 13 03:23 boxplot_purity.pdf

-rw-rw-r-- 1 kazu kazu 16K Oct 13 03:23 boxplot_purity.png

-rw-rw-r-- 1 kazu kazu 19K Oct 13 03:23 boxplot_purity_wo_legend.eps

-rw-rw-r-- 1 kazu kazu 36K Oct 13 03:23 ari_vs_assigned_bps.eps

-rw-rw-r-- 1 kazu kazu 15K Oct 13 03:23 ari_vs_assigned_bps.pdf

-rw-rw-r-- 1 kazu kazu 48K Oct 13 03:23 ari_vs_assigned_bps.png

-rw-rw-r-- 1 kazu kazu 16K Oct 13 03:23 avg_purity_completeness.pdf

-rw-rw-r-- 1 kazu kazu 37K Oct 13 03:23 avg_purity_completeness_per_bp.eps

-rw-rw-r-- 1 kazu kazu 16K Oct 13 03:23 avg_purity_completeness_per_bp.pdf

-rw-rw-r-- 1 kazu kazu 52K Oct 13 03:23 avg_purity_completeness_per_bp.png

-rw-rw-r-- 1 kazu kazu 40K Oct 13 03:23 avg_purity_completeness.eps

-rw-rw-r-- 1 kazu kazu 51K Oct 13 03:23 avg_purity_completeness.png

drwxrwxr-x 7 kazu kazu 238 Oct 13 03:23 elated_franklin_0

drwxrwxr-x 7 kazu kazu 238 Oct 13 03:23 goofy_hypatia_2

-rw-rw-r-- 1 kazu kazu 9.8K Oct 13 03:23 legend.eps

drwxrwxr-x 7 kazu kazu 238 Oct 13 03:23 naughty_carson_2

-rw-rw-r-- 1 kazu kazu 687 Oct 13 03:23 summary.tsv

評価内容

avg_purity: purity averaged over genome bins
std_dev_purity: standard deviation of purity averaged over genome bins
sem_purity: standard error of the mean of purity averaged over genome bins
avg_completeness: completeness averaged over genome bins
std_dev_completeness: standard deviation of completeness averaged over genome bins
sem_completeness: standard error of the mean of completeness averaged over genome bins
avg_purity_per_bp: average purity per base pair
avg_completeness_per_bp: average completeness per base pair
rand_index_by_bp: Rand index weighed by base pairs
rand_index_by_seq: Rand index weighed by sequence counts
a_rand_index_by_bp: adjusted Rand index weighed by base pairs
a_rand_index_by_seq: adjusted Rand index weighed by sequence counts
percent_assigned_bps: percentage of base pairs that were assigned to bins
accuracy: accuracy
>0.5compl<0.1cont: number of bins with more than 50% completeness and less than 10% contamination
>0.7compl<0.1cont: number of bins with more than 70% completeness and less than 10% contamination
>0.9compl<0.1cont: number of bins with more than 90% completeness and less than 10% contamination
>0.5compl<0.05cont: number of bins with more than 50% completeness and less than 5% contamination
>0.7compl<0.05cont: number of bins with more than 70% completeness and less than 5% contamination
>0.9compl<0.05cont: number of bins with more than 90% completeness and less than 5% contamination

summary.tsv

f:id:kazumaxneo:20181013123002j:plain

図も複数出力される。

purity_completeness_per_bin.pdf

f:id:kazumaxneo:20181013122657j:plain

ari_vs_assigned_bps.pdf

f:id:kazumaxneo:20181013122800j:plain

avg_purity_completeness.pdf

f:id:kazumaxneo:20181013122838j:plain

boxplot_completeness.pdf

f:id:kazumaxneo:20181013122721j:plain

boxplot_purity.pdf

f:id:kazumaxneo:20181013122738j:plain

summary.html

AMBER: Assessment of Metagenome BinnERs (example)

f:id:kazumaxneo:20181013123150j:plain

実際にランする流れはGithubのREADMEを呼んでください。contigをゲノムにmappingして得たゴールデンスタンダードのファイルも必要になります。

引用

AMBER: Assessment of Metagenome BinnERs
Fernando Meyer, Peter Hofmann, Peter Belmann, Ruben Garrido-Oter, Adrian Fritz, Alexander Sczyrba, Alice C McHardy

Gigascience. 2018 Jun; 7(6): giy069

CAMI benchmarkingの論文にもコマンドが載っています。

Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit | Nature Protocols