gtdbtkのde_novo_wfコマンド - macでインフォマティクス

マニュアルより

gtdbtkのde novo ワークフローは、ユーザー提供のゲノムと GTDB-Tk リファレンスゲノムを含むバクテリアと古細菌のツリーを推論する。分類学的な分類を得るにはclassify_wfワークフローを推奨し、de novoでdomain固有のツリーが必要な場合のみ本ワークフローを推奨する。このワークフローは、identify, align, infer, root, decorate の 5 つのステップで構成されている。identifyとalignのステップは、分類ワークフローと同じになっている。inferステップでは、FastTreeとWAG+GAMMAモデルを使用して、独立したde novoの細菌と古細菌のツリーを計算する。これらのツリーは、ユーザーが指定したアウトグループを使ってルート化され、GTDB taxonomyで装飾される。

de_novo_wf

https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html

インストール

Github

#bioconda（link）
mamba create -n gtdbtk -c conda-forge -c bioconda gtdbtk -y
conda activate gtdbtk

> gtdbtk

...::: GTDB-Tk v2.1.0 :::...

Workflows:

classify_wf -> Classify genomes by placement in GTDB reference tree

(identify -> align -> classify)

de_novo_wf -> Infer de novo tree and decorate with GTDB taxonomy

(identify -> align -> infer -> root -> decorate)

Methods:

identify -> Identify marker genes in genome

align -> Create multiple sequence alignment

classify -> Determine taxonomic classification of genomes

infer -> Infer tree from multiple sequence alignment

root -> Root tree using an outgroup

decorate -> Decorate tree with GTDB taxonomy

Tools:

infer_ranks -> Establish taxonomic ranks of internal nodes using RED

ani_rep -> Calculates ANI to GTDB representative genomes

trim_msa -> Trim an untrimmed MSA file based on a mask

export_msa -> Export the untrimmed archaeal or bacterial MSA file

remove_labels -> Remove labels (bootstrap values, node labels) from an Newick tree

convert_to_itol -> Convert a GTDB-Tk Newick tree to an iTOL tree

Testing:

test -> Validate the classify_wf pipeline with 3 archaeal genomes

check_install -> Verify third party programs and GTDB reference package

Use: gtdbtk <command> -h for command specific help

> gtdbtk de_novo_wf

usage: gtdbtk de_novo_wf (--genome_dir GENOME_DIR | --batchfile BATCHFILE) (--bacteria | --archaea) --outgroup_taxon OUTGROUP_TAXON --out_dir OUT_DIR [-x EXTENSION] [--skip_gtdb_refs] [--taxa_filter TAXA_FILTER] [--min_perc_aa MIN_PERC_AA] [--custom_msa_filters]

[--cols_per_gene COLS_PER_GENE] [--min_consensus MIN_CONSENSUS] [--max_consensus MAX_CONSENSUS] [--min_perc_taxa MIN_PERC_TAXA] [--rnd_seed RND_SEED] [--prot_model {JTT,WAG,LG}] [--no_support] [--gamma]

[--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE] [--custom_taxonomy_file CUSTOM_TAXONOMY_FILE] [--write_single_copy_genes] [--prefix PREFIX] [--genes] [--cpus CPUS] [--force] [--tmpdir TMPDIR] [--keep_intermediates] [--debug] [-h]

mutually exclusive required arguments:

--genome_dir GENOME_DIR

directory containing genome files in FASTA format

--batchfile BATCHFILE

path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])

mutually exclusive required arguments:

--bacteria process bacterial genomes (default: False)

--archaea process archaeal genomes (default: False)

required named arguments:

--outgroup_taxon OUTGROUP_TAXON

taxon to use as outgroup (e.g., p__Patescibacteria or p__Altarchaeota)

--out_dir OUT_DIR directory to output files

optional arguments:

-x, --extension EXTENSION

extension of files to process, gz = gzipped (default: fna)

--skip_gtdb_refs do not include GTDB reference genomes in multiple sequence alignment (default: False)

--taxa_filter TAXA_FILTER

filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.: d__Bacteria or p__Proteobacteria,p__Actinobacteria)

--min_perc_aa MIN_PERC_AA

exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound) (default: 10)

--custom_msa_filters perform custom filtering of MSA with cols_per_gene, min_consensus max_consensus, and min_perc_taxa parameters instead of using canonical mask (default: False)

--cols_per_gene COLS_PER_GENE

maximum number of columns to retain per gene when generating the MSA (default: 42)

--min_consensus MIN_CONSENSUS

minimum percentage of the same amino acid required to retain column (inclusive bound) (default: 25)

--max_consensus MAX_CONSENSUS

maximum percentage of the same amino acid required to retain column (exclusive bound) (default: 95)

--min_perc_taxa MIN_PERC_TAXA

minimum percentage of taxa required to retain column (inclusive bound) (default: 50)

--rnd_seed RND_SEED random seed to use for selecting columns, e.g. 42

--prot_model {JTT,WAG,LG}

protein substitution model for tree inference (default: WAG)

--no_support do not compute local support values using the Shimodaira-Hasegawa test (default: False)

--gamma rescale branch lengths to optimize the Gamma20 likelihood (default: False)

--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE

file with GTDB-Tk classifications produced by the `classify` command

--custom_taxonomy_file CUSTOM_TAXONOMY_FILE

file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup. Format: GENOME_ID<TAB>d__;p__;c__;o__;f__;g__;s__

--write_single_copy_genes

output unaligned single-copy marker genes (default: False)

--prefix PREFIX prefix for all output files (default: gtdbtk)

--genes indicates input files contain called genes (skip gene calling) (default: False)

--cpus CPUS number of CPUs to use (default: 1)

--force continue processing if an error occurs on a single genome (default: False)

--tmpdir TMPDIR specify alternative directory for temporary files (default: /tmp)

--keep_intermediates keep intermediate files in the final directory (default: False)

--debug create intermediate files for debugging purposes (default: False)

-h, --help show help message

> gtdbtk convert_to_itol -h

usage: gtdbtk convert_to_itol --input_tree INPUT_TREE --output_tree OUTPUT_TREE [--debug] [-h]

required named arguments:

--input_tree INPUT_TREE

path to the unrooted tree in Newick format

--output_tree OUTPUT_TREE

path to output the tree

optional arguments:

--debug create intermediate files for debugging purposes (default: False)

-h, --help show help message

実行方法

１、fasta形式のゲノムディレクトリとfastaファイルの拡張子、ドメイン、アウトグループの分類（ルートになる）、出力ディレクトリを指定する。オプションで--skip_gtdb_refsを付けるとGTDB reference genomeが含まれない。ただし。その場合は--custom_taxonomy_fileオプションも付けてGENOME_ID<TAB>d__;p__;c__;o__;f__;g__;s__形式のtaxonomy情報を提供する必要がある（ de_novo_wfでは要求されるがclassify_wfでは要求されない）。もしくは、--taxa_filterオプションでtaxonomy情報を提供すると、指定した分類群に属するゲノムだけ系統推論結果（系統樹）に保存される。その場合、その分類群に属するGTDB reference genomeも含まれる。prot_modelでツリー推定に用いるタンパク質置換モデル (LGまたはWAG; default: WAG)を指定できる。

gtdbtk de_novo_wf --genome_dir genomes/ --bacteria -x fna --outgroup_taxon p__Chloroflexota --taxa_filter p__Firmicutes --out_dir de_novo_output --cpus 20

--genome_dir directory containing genome files in FASTA format
--bacteria process bacterial genomes (default: False)
--archaea process archaeal genomes (default: False)
--outgroup_taxon taxon to use as outgroup (e.g., p__Patescibacteria or p__Altarchaeota)
--out_dir directory to output files
-x extension of files to process, gz = gzipped (default: fna)
--skip_gtdb_refs do not include GTDB reference genomes in multiple sequence alignment (default: False)
--taxa_filter filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.: d__Bacteria or p__Proteobacteria,p__Actinobacteria)
--custom_taxonomy_file file indicating custom taxonomy strings for user genomes
--prot_model {JTT, WAG, LG} protein substitution model for tree inference (default: WAG)

出力例

gtdbtk.bac120.decorated.treeがツリーファイル（bacteriaの時）。

２、Qiime1の filter_tree.pyスクリプトで、gtdbtk.bac120.decorated.treeからGTDB referenceのleafだけフィルタリングすることができる。

https://kazumaxneo.hatenablog.com/entry/2022/08/08/140937

３、フィルタリング後、iTOLでツリーを可視化するには、 gtdbtk convert_to_itolコマンドを実行する。

gtdbtk convert_to_itol --input_tree input.tree --output_tree output.tree

--input_tree path to the unrooted tree in Newick format
--output_tree path to output the tree

output.treeをiTOLに読み込む。

引用

GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database
Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, Donovan H Parks

bioRxiv, Posted July 22, 2022.