メタゲノムアセンブリのコンタミネーションを調べる magpurify

　ヒトの腸内細菌叢の多くの種のゲノム配列は、実験室条件下での微生物の培養が困難であることが主な理由で、依然として不明である。本研究では、地理的にも表現型的にも多様なヒトの3,810の糞便メタゲノムから60,664の原核生物のドラフトゲノムを再構築することで、この問題に取り組んだ。これらのゲノムは、新たに同定された2,058の種レベルのOTUの基準点となり、これまでに配列決定された腸内細菌の系統的な多様性に比べて50％の増加となった。また，新たに同定されたOTUは，平均して個人あたりの菌数で33％，種数で28％を占めており，農村部の人間に多く含まれている。また、臨床的な腸内細菌叢研究のメタ分析により、新たに同定されたOTUには多くの疾患との関連性が指摘されており、予測モデルの改善につながる可能性がある。最後に、未培養の腸内細菌は、ゲノムが縮小された結果、特定の生合成パスウェイが失われていることが明らかになり、今後の培養戦略を改善する手がかりになると考えられる。

Githubより

　本パッケージは、様々な機能とアルゴリズムを組み合わせて、メタゲノムアセンブリゲノム（MAG）のコンタミネーションを同定する。コンタミネーションとは、MAGに存在する優勢な生物とは相対的に異なる種に由来するコンティグと定義される。

このソフトウェアパッケージの各モジュールは、高い特異性を持つように設計されている。これは、すべてのコンタミネーション（他の生物種に由来するコンティグ）が除去されるわけではなく、誤って除去されるコンティグは非常に少ないことを意味する。

インストール

依存

If you install MAGpurify using conda, all dependencies will be installed automatically. However, if you choose to install it through pip, you will need to install some required third-party software:

BLAST
Prodigal
HMMER
LAST
Mash
CoverM

GIthub

#bioconda (link)
mamba install -c bioconda magpurify

#pypi (link)
pip install magpurify

> magpurify -h

$ magpurify

usage: magpurify [-h] [--version]

{phylo-markers,clade-markers,conspecific,tetra-freq,gc-content,coverage,known-contam,clean-bin}

...

Identify and remove incorrectly binned contigs from metagenome-assembled

genomes.

positional arguments:

{phylo-markers,clade-markers,conspecific,tetra-freq,gc-content,coverage,known-contam,clean-bin}

phylo-markers find taxonomic discordant contigs using a database of

phylogenetic marker genes.

clade-markers find taxonomic discordant contigs using a database of

clade-specific marker genes.

conspecific find contigs that fail to align to closely related

genomes.

tetra-freq find contigs with outlier tetranucleotide frequency.

gc-content find contigs with outlier GC content.

coverage find contigs with outlier coverage profile.

known-contam find contigs that match a database of known

contaminants.

clean-bin remove putative contaminant contigs from bin.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

> magpurify phylo-markers

$ magpurify phylo-markers

usage: magpurify phylo-markers [-h] [--db DB] [--continue]

[--max_target_seqs MAX_TARGET_SEQS]

[--cutoff_type {strict,sensitive,none}]

[--seq_type {dna,protein,both,either}]

[--hit_type {all_hits,top_hit}]

[--exclude_clades EXCLUDE_CLADES [EXCLUDE_CLADES ...]]

[--bin_fract BIN_FRACT]

[--contig_fract CONTIG_FRACT] [--allow_noclass]

[--threads THREADS]

fna out

Find taxonomic discordant contigs using a database of phylogenetic marker

genes.

positional arguments:

fna Path to input genome in FASTA format

out Output directory to store results and intermediate

files

optional arguments:

-h, --help show this help message and exit

--db DB Path to reference database. By default, the

MAGPURIFYDB environmental variable is used (default:

None)

--continue Go straight to quality estimation and skip all

previous steps (default: False)

--max_target_seqs MAX_TARGET_SEQS

Maximum number of targets reported in BLAST table

(default: 1)

--cutoff_type {strict,sensitive,none}

Use strict or sensitive %ID cutoff for taxonomically

annotating genes (default: strict)

--seq_type {dna,protein,both,either}

Choose to search genes versus DNA or protein database

(default: protein)

--hit_type {all_hits,top_hit}

Transfer taxonomy of all hits or top hit per gene

(default: top_hit)

--exclude_clades EXCLUDE_CLADES [EXCLUDE_CLADES ...]

List of clades to exclude (ex: s__1300164.4) (default:

None)

--bin_fract BIN_FRACT

Min fraction of genes in bin that agree with consensus

taxonomy for bin annotation (default: 0.7)

--contig_fract CONTIG_FRACT

Min fraction of genes in that disagree with bin

taxonomy for filtering (default: 1.0)

--allow_noclass Allow a bin to be unclassfied and flag any classified

contigs (default: False)

--threads THREADS Number of CPUs to use (default: 1)

> magpurify clade-markers

$ magpurify clade-markers

usage: magpurify clade-markers [-h] [--db DB]

[--exclude_clades EXCLUDE_CLADES [EXCLUDE_CLADES ...]]

[--min_bin_fract MIN_BIN_FRACT]

[--min_contig_fract MIN_CONTIG_FRACT]

[--min_gene_fract MIN_GENE_FRACT]

[--min_genes MIN_GENES]

[--lowest_rank {s,g,f,o,c,p,k}]

[--threads THREADS]

fna out

Find taxonomic discordant contigs using a database of clade-specific marker

genes.

positional arguments:

fna Path to input genome in FASTA format

out Output directory to store results and intermediate

files

optional arguments:

-h, --help show this help message and exit

--db DB Path to reference database. By default, the MAGPURIFY

environmental variable is used (default: None)

--exclude_clades EXCLUDE_CLADES [EXCLUDE_CLADES ...]

List of clades to exclude (ex: s__Variovorax_sp_CF313)

(default: None)

--min_bin_fract MIN_BIN_FRACT

Min fraction of bin length supported by contigs that

agree with consensus taxonomy (default: 0.6)

--min_contig_fract MIN_CONTIG_FRACT

Min fraction of classified contig length that agree

with consensus taxonomy (default: 0.75)

--min_gene_fract MIN_GENE_FRACT

Min fraction of classified genes that agree with

consensus taxonomy (default: 0.0)

--min_genes MIN_GENES

Min number of genes that agree with consensus taxonomy

(default=rank-specific-cutoffs) (default: None)

--lowest_rank {s,g,f,o,c,p,k}

Lowest rank for bin classification (default: None)

--threads THREADS Number of CPUs to use (default: 1)

> magpurify tetra-freq

$ magpurify tetra-freq

usage: magpurify tetra-freq [-h] [--cutoff CUTOFF] [--weighted-mean] fna out

Find contigs with outlier tetranucleotide frequency.

positional arguments:

fna Path to input genome in FASTA format

out Output directory to store results and intermediate files

optional arguments:

-h, --help show this help message and exit

--cutoff CUTOFF Cutoff (default: 0.06)

--weighted-mean Compute the mean weighted by the contig length (default:

False)

> magpurify gc-content

$ magpurify gc-content

usage: magpurify gc-content [-h] [--cutoff CUTOFF] [--weighted-mean] fna out

Find contigs with outlier GC content.

positional arguments:

fna Path to input genome in FASTA format

out Output directory to store results and intermediate files

optional arguments:

-h, --help show this help message and exit

--cutoff CUTOFF Cutoff (default: 15.75)

--weighted-mean Compute the mean weighted by the contig length (default:

False)

> magpurify known-contam

$ magpurify known-contam

usage: magpurify known-contam [-h] [--db DB] [--pid PID] [--evalue EVALUE]

[--qcov QCOV] [--threads THREADS]

fna out

Find contigs that match a database of known contaminants.

positional arguments:

fna Path to input genome in FASTA format

out Output directory to store results and intermediate files

optional arguments:

-h, --help show this help message and exit

--db DB Path to reference database. By default, the IMAGEN_DB

environmental variable is used (default: None)

--pid PID Minimum % identity to reference (default: 98)

--evalue EVALUE Maximum evalue (default: 1e-05)

--qcov QCOV Minimum percent query coverage (default: 25)

--threads THREADS Number of CPUs to use (default: 1)

データベース

wget -O MAGpurify-db-v1.0.tar.bz2  https://zenodo.org/record/3688811/files/MAGpurify-db-v1.0.tar.bz2?download=1
tar -jxvf MAGpurify-db-v1.0.tar.bz2
export MAGPURIFYDB=/path/to/MAGpurify-db-v1.0

データベースのパスを通さなくとも、--dbでデータベースのパスを指定すればランできる。

＊tetra-freq と gc-content は、外部データに依存しない。phylo-markers、clade-markers、known-contamの各モジュールは、このMAGpurifyデータベースを使って実行する。

テストラン

１、phylo-markers

系統マーカー遺伝子のデータベースを用いて、分類学的に不一致なコンティグを見つけrる。

git clone https://github.com/snayfach/MAGpurify.git
cd MAGpurify/
magpurify phylo-markers example/test.fna outdir

２、clade-markers

系統特異的なマーカー遺伝子のデータベースを用いて、分類学的に不一致なコンティグを見つける。

magpurify clade-markers example/test.fna outdir

3、tetra-freq

異常なテトラヌクレオチド頻度を持つコンティグを見つける。

magpurify tetra-freq example/test.fna outdir

４、gc-content

GC含有率が異常値を示すコンティグを見つける。

magpurify gc-content example/test.fna outdir

５、known-contam

既知の汚染物質のデータベースにマッチするコンティグを見つける。

magpurify known-contam example/test.fna outdir

６、conspecific

近縁種のゲノムにアラインメントできないコンティグを見つける。MASHのsketchファイルが必要（MASH link）。

magpurify conspecific example/test.fna outdir mash_sketch_file

ここまでの出力

f:id:kazumaxneo:20210531225810p:plain

*conspecific は未実行

7、clean-bin

1-6の結果から、推定汚染コンティグをbinから取り除く。

magpurify clean-bin example/test.fna outdir output_bin.fasta

output_bin.fastaが出力される。

引用

New insights from uncultivated genomes of the global human gut microbiome
Stephen Nayfach, Zhou Jason Shi, Rekha Seshadri, Katherine S. Pollard & Nikos C. Kyrpides
Nature volume 568, pages505–510 (2019)