マニュアルより
gtdbtkのde novo ワークフローは、ユーザー提供のゲノムと GTDB-Tk リファレンスゲノムを含むバクテリアと古細菌のツリーを推論する。分類学的な分類を得るにはclassify_wfワークフローを推奨し、de novoでdomain固有のツリーが必要な場合のみ本ワークフローを推奨する。このワークフローは、identify, align, infer, root, decorate の 5 つのステップで構成されている。identifyとalignのステップは、分類ワークフローと同じになっている。inferステップでは、FastTreeとWAG+GAMMAモデルを使用して、独立したde novoの細菌と古細菌のツリーを計算する。これらのツリーは、ユーザーが指定したアウトグループを使ってルート化され、GTDB taxonomyで装飾される。
de_novo_wf
https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html
インストール
#bioconda(link)
mamba create -n gtdbtk -c conda-forge -c bioconda gtdbtk -y
conda activate gtdbtk
> gtdbtk
...::: GTDB-Tk v2.1.0 :::...
Workflows:
classify_wf -> Classify genomes by placement in GTDB reference tree
(identify -> align -> classify)
de_novo_wf -> Infer de novo tree and decorate with GTDB taxonomy
(identify -> align -> infer -> root -> decorate)
Methods:
identify -> Identify marker genes in genome
align -> Create multiple sequence alignment
classify -> Determine taxonomic classification of genomes
infer -> Infer tree from multiple sequence alignment
root -> Root tree using an outgroup
decorate -> Decorate tree with GTDB taxonomy
Tools:
infer_ranks -> Establish taxonomic ranks of internal nodes using RED
ani_rep -> Calculates ANI to GTDB representative genomes
trim_msa -> Trim an untrimmed MSA file based on a mask
export_msa -> Export the untrimmed archaeal or bacterial MSA file
remove_labels -> Remove labels (bootstrap values, node labels) from an Newick tree
convert_to_itol -> Convert a GTDB-Tk Newick tree to an iTOL tree
Testing:
test -> Validate the classify_wf pipeline with 3 archaeal genomes
check_install -> Verify third party programs and GTDB reference package
Use: gtdbtk <command> -h for command specific help
> gtdbtk de_novo_wf
usage: gtdbtk de_novo_wf (--genome_dir GENOME_DIR | --batchfile BATCHFILE) (--bacteria | --archaea) --outgroup_taxon OUTGROUP_TAXON --out_dir OUT_DIR [-x EXTENSION] [--skip_gtdb_refs] [--taxa_filter TAXA_FILTER] [--min_perc_aa MIN_PERC_AA] [--custom_msa_filters]
[--cols_per_gene COLS_PER_GENE] [--min_consensus MIN_CONSENSUS] [--max_consensus MAX_CONSENSUS] [--min_perc_taxa MIN_PERC_TAXA] [--rnd_seed RND_SEED] [--prot_model {JTT,WAG,LG}] [--no_support] [--gamma]
[--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE] [--custom_taxonomy_file CUSTOM_TAXONOMY_FILE] [--write_single_copy_genes] [--prefix PREFIX] [--genes] [--cpus CPUS] [--force] [--tmpdir TMPDIR] [--keep_intermediates] [--debug] [-h]
mutually exclusive required arguments:
--genome_dir GENOME_DIR
directory containing genome files in FASTA format
--batchfile BATCHFILE
path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])
mutually exclusive required arguments:
--bacteria process bacterial genomes (default: False)
--archaea process archaeal genomes (default: False)
required named arguments:
--outgroup_taxon OUTGROUP_TAXON
taxon to use as outgroup (e.g., p__Patescibacteria or p__Altarchaeota)
--out_dir OUT_DIR directory to output files
optional arguments:
-x, --extension EXTENSION
extension of files to process, gz = gzipped (default: fna)
--skip_gtdb_refs do not include GTDB reference genomes in multiple sequence alignment (default: False)
--taxa_filter TAXA_FILTER
filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.: d__Bacteria or p__Proteobacteria,p__Actinobacteria)
exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound) (default: 10)
--custom_msa_filters perform custom filtering of MSA with cols_per_gene, min_consensus max_consensus, and min_perc_taxa parameters instead of using canonical mask (default: False)
--cols_per_gene COLS_PER_GENE
maximum number of columns to retain per gene when generating the MSA (default: 42)
--min_consensus MIN_CONSENSUS
minimum percentage of the same amino acid required to retain column (inclusive bound) (default: 25)
--max_consensus MAX_CONSENSUS
maximum percentage of the same amino acid required to retain column (exclusive bound) (default: 95)
minimum percentage of taxa required to retain column (inclusive bound) (default: 50)
--rnd_seed RND_SEED random seed to use for selecting columns, e.g. 42
--prot_model {JTT,WAG,LG}
protein substitution model for tree inference (default: WAG)
--no_support do not compute local support values using the Shimodaira-Hasegawa test (default: False)
--gamma rescale branch lengths to optimize the Gamma20 likelihood (default: False)
--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE
file with GTDB-Tk classifications produced by the `classify` command
--custom_taxonomy_file CUSTOM_TAXONOMY_FILE
file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup. Format: GENOME_ID<TAB>d__;p__;c__;o__;f__;g__;s__
--write_single_copy_genes
output unaligned single-copy marker genes (default: False)
--prefix PREFIX prefix for all output files (default: gtdbtk)
--genes indicates input files contain called genes (skip gene calling) (default: False)
--cpus CPUS number of CPUs to use (default: 1)
--force continue processing if an error occurs on a single genome (default: False)
--tmpdir TMPDIR specify alternative directory for temporary files (default: /tmp)
--keep_intermediates keep intermediate files in the final directory (default: False)
--debug create intermediate files for debugging purposes (default: False)
-h, --help show help message
> gtdbtk convert_to_itol -h
usage: gtdbtk convert_to_itol --input_tree INPUT_TREE --output_tree OUTPUT_TREE [--debug] [-h]
required named arguments:
--input_tree INPUT_TREE
path to the unrooted tree in Newick format
--output_tree OUTPUT_TREE
path to output the tree
optional arguments:
--debug create intermediate files for debugging purposes (default: False)
-h, --help show help message
実行方法
1、fasta形式のゲノムディレクトリとfastaファイルの拡張子、ドメイン、アウトグループの分類(ルートになる)、出力ディレクトリを指定する。オプションで--skip_gtdb_refsを付けるとGTDB reference genomeが含まれない。ただし。その場合は--custom_taxonomy_fileオプションも付けてGENOME_ID<TAB>d__;p__;c__;o__;f__;g__;s__形式のtaxonomy情報を提供する必要がある( de_novo_wfでは要求されるがclassify_wfでは要求されない)。もしくは、--taxa_filterオプションでtaxonomy情報を提供すると、指定した分類群に属するゲノムだけ系統推論結果(系統樹)に保存される。その場合、その分類群に属するGTDB reference genomeも含まれる。prot_modelでツリー推定に用いるタンパク質置換モデル (LGまたはWAG; default: WAG)を指定できる。
gtdbtk de_novo_wf --genome_dir genomes/ --bacteria -x fna --outgroup_taxon p__Chloroflexota --taxa_filter p__Firmicutes --out_dir de_novo_output --cpus 20
- --genome_dir directory containing genome files in FASTA format
- --bacteria process bacterial genomes (default: False)
- --archaea process archaeal genomes (default: False)
-
--outgroup_taxon taxon to use as outgroup (e.g., p__Patescibacteria or p__Altarchaeota)
-
--out_dir directory to output files
-
-x extension of files to process, gz = gzipped (default: fna)
-
--skip_gtdb_refs do not include GTDB reference genomes in multiple sequence alignment (default: False)
-
--taxa_filter filter GTDB genomes to taxa (comma separated) within specific taxonomic groups (e.g.: d__Bacteria or p__Proteobacteria,p__Actinobacteria)
-
--custom_taxonomy_file file indicating custom taxonomy strings for user genomes
-
--prot_model {JTT, WAG, LG} protein substitution model for tree inference (default: WAG)
出力例
gtdbtk.bac120.decorated.treeがツリーファイル(bacteriaの時)。
2、Qiime1の filter_tree.pyスクリプトで、gtdbtk.bac120.decorated.treeからGTDB referenceのleafだけフィルタリングすることができる。
https://kazumaxneo.hatenablog.com/entry/2022/08/08/140937
3、フィルタリング後、iTOLでツリーを可視化するには、 gtdbtk convert_to_itolコマンドを実行する。
gtdbtk convert_to_itol --input_tree input.tree --output_tree output.tree
- --input_tree path to the unrooted tree in Newick format
- --output_tree path to output the tree
output.treeをiTOLに読み込む。
引用
GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database
Pierre-Alain Chaumeil, Aaron J Mussig, Philip Hugenholtz, Donovan H Parks
bioRxiv, Posted July 22, 2022.
関連