MAGの株レベルでの定量を可能にする MAGinator

2023/10/10 追記

　メタゲノムシーケンスはマイクロバイオームの特性解析に大きな利点をもたらしたが、現在利用可能な解析ツールには、菌株レベルの分類学的解像度と存在量の推定を、アセンブルされたゲノムの機能プロファイリングと組み合わせる能力が欠けている。マイクロバイオームとヒトの健康との関連を明らかにするためには、微生物組成の包括的な理解と微生物間の系統的・機能的関係の解明を可能にする改良されたツールが必要である。ここで本著者らは、ショットガンメタゲノミクスデータセットのプロファイリング用に調整された、フリーで利用可能なツールであるMAGinatorを紹介する。MAGinatorは、亜種レベルの微生物をde novoで同定し、メタゲノムアセンブリゲノム（MAG）の正確な存在量を推定する。MAGinatorは、遺伝子ベースとコンティグベースの両方の手法からの情報を利用し、分類学的プロファイルと遺伝子の起源、および宿主生物ごとの各サンプルの機能的内容の推論に使用される遺伝的内容の両方についての洞察をもたらす。さらに、MAGinatorはMAG間の系統関係の再構築を容易にし、亜種MAG内のクレードレベルの違いを同定する枠組みを提供する。MAGinatorはPythonモジュールとしてhttps://github.com/Russel88/MAGinatorで利用できる。

インストール

著者らの説明に従ってubuntu18に導入した。マニュアルに書かれている導入方法だとpython3.6環境が作られて実行中にエラーが出た。下の通りpython3.10環境に導入した。

Github

mamba create -n maginator python=3.10
conda activate maginator
pip install snakemake
pip install maginator

> maginator -h

$ maginator -h

usage: maginator -v VAMB_CLUSTERS -r READS -c CONTIGS -o OUTPUT -g GTDB_DB

[--cluster {None,qsub,slurm,drmaa}]

[--cluster_info CLUSTER_INFO] [--max_jobs MAX_JOBS] [-V] [-h]

[--max_cores MAX_CORES] [--max_mem MAX_MEM]

[--log_lvl {DEBUG,INFO,WARNING,ERROR}] [--only_conda]

[--snake SNAKE] [--unlock] [--binsize BINSIZE]

[--mgs_collections]

[--annotation_prevalence ANNOTATION_PREVALENCE]

[--clustering_coverage CLUSTERING_COVERAGE]

[--clustering_min_seq_id CLUSTERING_MIN_SEQ_ID]

[--clustering_type {nucleotide,protein}]

[--min_gtdb_markers MIN_GTDB_MARKERS]

[--marker_gene_cluster_prevalence MARKER_GENE_CLUSTER_PREVALENCE]

[--min_af MIN_AF] [--min_depth MIN_DEPTH]

[--min_nonN MIN_NONN] [--min_marker_genes MIN_MARKER_GENES]

[--min_signature_genes MIN_SIGNATURE_GENES]

[--phylo {fasttree,iqtree}]

[--tax_scope_threshold TAX_SCOPE_THRESHOLD]

[--synteny_adj_cutoff SYNTENY_ADJ_CUTOFF]

[--synteny_mcl_inflation SYNTENY_MCL_INFLATION]

MAGinator version 0.1.18

required arguments:

-v VAMB_CLUSTERS, --vamb_clusters VAMB_CLUSTERS

Path to VAMB clusters.tsv file

-r READS, --reads READS

Comma-delimited file with format: SampleName,AbsolutePathToForwardReads,AbsolutePathToReverseReads. SampleNames should match the 1st column in clusters.tsv with the pattern SampleName_{clusternumber}

-c CONTIGS, --contigs CONTIGS

Fasta file with contig sequences. Fasta headers should match the 2nd column in the clusters.tsv file

-o OUTPUT, --output OUTPUT

Prefix for output directory

-g GTDB_DB, --gtdb_db GTDB_DB

Path to GTDB-tk database

compute cluster arguments:

--cluster {None,qsub,slurm,drmaa}

Cluster compute structure [None]

--cluster_info CLUSTER_INFO

Cluster scheduler arguments when submitting cluster jobs.

Has to contain the following special strings:

{mem_gb}, {cores}, and {runtime}.

These special strings will be substituted by maginator to indicate resources for each job.

{mem_gb} is substituted for the mem_gb in GB.

{runtime} is substituted with the time in the following format: DD:HH:MM:SS.

Can also contain user names, groups, etc. required by the cluster scheduler

--max_jobs MAX_JOBS Maximum number of cluster jobs [500]

optional arguments:

-V, --version show program's version number and exit

-h, --help show this help message and exit

--max_cores MAX_CORES

Maximum number of cores [40]

--max_mem MAX_MEM Maximum mem_gb in GB [180]

--log_lvl {DEBUG,INFO,WARNING,ERROR}

Logging level [INFO].

--only_conda Only install conda environments, then exit

--snake SNAKE Only run specific snakemake command. For debug purposes

--unlock Unlock snakemake directory in case of unexpected exists, then exit

parameters:

--binsize BINSIZE Minimum bin size for inclusion [200000].

--mgs_collections If set, bin clusters will be aggregated to metagenomic species.

--annotation_prevalence ANNOTATION_PREVALENCE

Minimum prevalence of taxonomic assignment in a cluster of bins to call consensus [0.75]

--clustering_coverage CLUSTERING_COVERAGE

Alignment coverage for clustering of genes with MMseqs2 [0.8]

--clustering_min_seq_id CLUSTERING_MIN_SEQ_ID

Sequence identity threshold for clustering of genes with MMseqs2 [0.95]

--clustering_type {nucleotide,protein}

Sequence type for gene clustering with MMseqs2. Nucleotide- or protein-level [protein]

--min_gtdb_markers MIN_GTDB_MARKERS

Minimum GTDBtk marker genes shared between MGS and outgroup for rooting trees [10]

--marker_gene_cluster_prevalence MARKER_GENE_CLUSTER_PREVALENCE

Minimum prevalence of marker genes to be selected for rooting of MGS trees [0.5]

--min_af MIN_AF Minimim allele frequency for calling a base when creating phylogenies [0.8]

--min_depth MIN_DEPTH

Minimim read depth for calling a base when creating phylogenies [2]

--min_nonN MIN_NONN Minimum fraction of non-N characters of a sample to be included in a phylogeny [0.5]

--min_marker_genes MIN_MARKER_GENES

Minimum marker genes to be detected for inclusion of a sample in a phylogeny [2]

--min_signature_genes MIN_SIGNATURE_GENES

Minimum signature genes to be detected for inclusion of a sample in a phylogeny [50]

--phylo {fasttree,iqtree}

Software for phylogeny inference. Either fast (fasttree) or slow and more accurate (iqtree) [fasttree]

--tax_scope_threshold TAX_SCOPE_THRESHOLD

Threshold for assigning the taxonomic scope of a gene cluster [0.9]

--synteny_adj_cutoff SYNTENY_ADJ_CUTOFF

Minimum number of times gene clusters should be adjacent to be included in synteny graph [1]

--synteny_mcl_inflation SYNTENY_MCL_INFLATION

Inflation parameter for mcl clustering of synteny graph. Usually between 1.2 and 5. Higher values produce smaller clusters [5]

GTDB-tk D.BのR20７も必要（ダウンロードしてない人のみ）

wget https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz
tar xvzf gtdbtk_v2_data.tar.gz

実行方法

ランするには、VAMBの結果のclusters.tsvファイル、すべてのコンティグのfastaファイル（全ての配列名がユニークであること）、fastqファイルのパスを示したカンマ区切りのファイル；reads.csv（SampleName,PathToForwardReads,PathToReverseReads）、が必要。

#テストラン
git clone https://github.com/Russel88/MAGinator.git
cd MAGinator/maginator/test_data/
gzip -dv contigs.fasta.gz

#SRA３つをダウンロード
fasterq-dump SRR14001027 SRR14001026 SRR14000819 -O SRA -t /tmp -e 8 -p

#reads.csvのfastqのパスをダウンロードしたfastqに変更
#実行
maginator -v vamb_clusters.tsv -r reads.csv -c contigs.fasta -o my_output -g "/path/to/GTDB-Tk/database/release207_v2/"

MAGinatorを実行するためのファイルを簡単に作成するために、メタゲノミクスリードの前処理、アセンブル、ビニング*を行うsnakemakeのワークフローが作成されている（レポジトリ参照）。

引用

MAGinator enables strain-level quantification of de novo MAGs

Trine Zachariasen, Jakob Russel, Charisse Petersen, Gisle A. Vestergaard, Shiraz Shah, Stuart E. Turvey, Søren J. Sørensen, Ole Lund, Jakob Stokholm, Asker Brejnrod, Jonathan Thorsen

bioRxiv, Posted August 28, 2023