2020-12-15

大規模な微生物ゲノムの系統推定を行う PhyloPhlAn 3.0（メタゲノムアセンブリにも対応）

2020 12/17 誤字修正

2021 1/25 help追記, linkミス修正

2021 11/10 説明追加

2023/08/23 チュートリアル追記

　単離物やメタゲノムアセンブリ、シングルセルのゲノム配列解読はますます加速しており、これらはすべて公的資源によって利用可能になってきている。これにより、人体や地球に影響を及ぼす微生物の多様性の全体的な特徴を把握するための貴重な洞察を得ることができる。微生物ゲノミクスでは、系統とそれに対応する分類学的特徴付けは、事前に表現型情報がなくてもゲノムの文脈を把握し、その遺伝的新規性や遺伝子型と表現型の関係を決定するために非常に重要である。最大規模では、完全な微生物の生命誌を再構築することは、どのような文脈においても進化の関係を理解する上で基本的なことであり、微生物群集研究においては、このようなリファレンスは、新規配列と健康や環境に関連する微生物との間の重要なリンクとなり得る。規模の大小にかかわらず、現在の多くの微生物ゲノムの課題には、新たに配列決定されたゲノムやメタゲノムを微生物分類学に位置づけ、最も近い近縁種との関連で系統的に特徴付ける必要性が含まれている。しかし、このような量の微生物ゲノムは、幅広い品質と完全性で生成されるため、ゲノムとメタゲノムを研究する研究者にとって、これらの課題に容易に取り組むことができるスケーラブルな系統分類法は存在しない。

　よりターゲットを絞った微生物ゲノム・メタゲノム系統解析のための多くの手法が存在する。これらには、より一般的なゲノムおよびゲノムベースの系統学のための多くの他のものの中で、PhyloPhlAn、PhyloSift、ezTree、GToTree、およびAMPHORAの最初の実装が含まれる。これらの方法のほとんどは、少なくとも一つの点で制限されており、新たに配列決定されたゲノムやメタゲノム配列を、すでに特徴づけられた微生物系統の膨大な空間にリンクさせることが容易ではない。例えば、異なるクローンで最適な分解能を達成するために、異なるゲノム領域を選択することはできない。これでは、一部のクレードではパフォーマンスが低下するだけでなく、系統レベルと系統レベルの配置に同じ方法を使用することができない。GToTreeは公開されている10万以上の微生物ゲノムと1万以上のメタゲノムから20万以上のメタゲノムアセンブリゲノム（MAG）の完全なセットを利用していない。分離株配列のゲノムアセンブリーや、メタゲノムデータの既知の特徴を定量的に解析するための計算手法は成熟し、標準化されているが、MAGや微生物分離株ゲノムの下流での系統分類学的評価や分類学的評価のための便利で自動化されたツールは不足しており、微生物ゲノム解析を制限しているのが現状である。

　これらのエンドツーエンドの系統別ソリューションはまた、ゲノム配置の個々のステップ（例えば、pplacerおよびSEPP）および分類学的評価のためのアルゴリズムおよび実装とは区別されるべきである。例としては、MUSCLE、MAFFT、T-Coffee、OPAL、PASTA、およびUPPのような多重配列アライメント（MSA）のためのアルゴリズム、およびFastTree、RAxML、ASTRAL、ASTRID、およびIQ-TREEのような系統再構成が挙げられる。それぞれのツールは、系統解析全体をステップバイステップで完全に制御することができるが、そのためには、計算系統解析に適したターゲット、パラメータ、ステップを特定するだけでなく、そのようなツールを他のツールとどのように連携させるべきかを理解するための相当な専門知識を必要とする。

　また、個々の研究で何千もの微生物ゲノムを生成する場合や、膨大な数のゲノムを組み合わせて検索して解析する場合には、これらのステップを別々に人間が管理して実行することは非現実的である。非常に効率的なアルゴリズムが提案されており、その中には、MLSTアプローチや種レベルのコア遺伝子のような少数の代表的なマーカー遺伝子のタイピングに基づくものも含まれている。例えば、計算上のMLSTは、各種について5～10の遺伝子座を用いて迅速に動作することができる。しかし、これは系統配置の精度を大幅に低下させる代償を伴う。Roaryのようなパンゲノームベースのプロファイリングは、種レベルでの系統樹のモデル化には非常に正確だが、より高いレベルのクレードには一般化できない。多様な種からの何千ものリファレンスゲノムを統合した系統樹は、微生物の集団構造と特徴をより正確に特徴づけることができ、分類学をより正確に導くことができる。全ゲノム大規模微生物系統樹は、特にpartialなアセンブリに強く、既存のゲノムとメタゲノムのアセンブリを統合することができるが、計算上の未解決の課題である。

　本研究では、新たにアセンブルされた微生物単離株とメタゲノムのコンテキスト化と特性化のための完全自動のエンドツーエンド系統解析フレームワークであるPhyloPhlAn 3.0を紹介する。PhyloPhlAn 3.0は、必要に応じて、公開リソースから何十万ものゲノムを検索して統合することができ、同時に何万ものメタゲノムから前処理された情報を取り込むことができる。UniRef90遺伝子ファミリーを使用して安定的に同定されたコアタンパク質の種特異的なセットを自動的に使用して、系統レベルの正確な系統を構築すると同時に、深い分岐や非常に大きなサイズの系統を推論するために数万ゲノムにスケーリングする。PhyloPhlAn 3.0は、系統と種のレベルで正確に系統を構築し、利用可能なゲノムの全セットにスケーリングする際には高速である。ゲノム分類学データベース(GTDB)のような代替と比較して、PhyloPhlAn 3.0は、NCBI 分類学に基づいて自動的にMAGの分類学的割り当てを実行し、ゲノムコンテキストタスクでは、無名で未確認の種を考慮することができる。

　PhyloPhlAn 3.0は、微生物（メタ）ゲノムの正確な系統分類学的コンテクスト化のための、使いやすく完全自動の手法を提供する（論文図1）。この手法では、単離配列からの微生物ゲノムとMAGを組み合わせた入力セットを考慮して、複数のレベルの分解能で系統樹を作成することができる。入力ゲノムとMAGの配置は、系統樹のde novo再構成によって行われる。関連株の高分解能系統樹については、PhyloPhlAn 3.0では、あらかじめ選択されたUniRef90遺伝子ファミリーの18,000セット以上から、種特異的なコア遺伝子を使用している。その代わりに、多様性の高いゲノムについては、400の最も普遍的なマーカー（ref. 1,26）を使用し、より積極的なアラインメントトリミングオプションを使用している（論文の方法を参照）。また、多分解能系統再構成は、入力ゲノムまたはMAGに、属から種レベルまでの分類学的ラベルを割り当てるアプローチの中核となっており、15万以上のMAGと、PhyloPhlAn 3.0データベースに統合された8万以上の参照ゲノムを利用している。このパイプラインでは、利用可能な全ゲノム微生物データを統合し、特定のタスクの特徴と規模に応じていくつかの方法論的進歩を採用することで、入力ゲノムを系統的に文脈化することができる（方法を参照）。PhyloPhlAn 3.0は、内部ステップの特定の方法論的選択に縛られることはない。ユーザーは、配列マッピングツール、MSAのツール、アラインメントの後処理のツールをそれぞれ複数のツールの中から選択でき、系統モデルについては、連結されたアラインメントに適用される最尤法から複数の異なるマーカーの情報を統合した遺伝子ツリーアプローチまで網羅している。系統樹に加えて、分類学的割り当てから種レベルのゲノムビン（SGB）、生の多重整列アラインメント、系統樹内のゲノムの突然変異率の統計に至るまで、豊富な出力が提供される。PhyloPhlAn 3.0は、最初のPhyloPhlAnバージョンを完全に書き換えたものであり、多くの機能と、数万の入力ゲノムとMAGへの拡張機能は、新しいバージョンに特有のものである（論文の補足データ1）。

https://huttenhower.sph.harvard.edu/phylophlan/

Segata Lab - Computational Metagenomics

User manual

https://github.com/biobakery/phylophlan/wiki

example data

https://zenodo.org/record/3820090#.XtMVsi_CPRY

PhyloPhlAn 3.0 user manual https://t.co/yAt8OAgqFs and tutorials https://t.co/AIwfqdQQCk
User support through the bioBakery help forum https://t.co/pawtRLxJCt and https://t.co/Ut1aEMOAzr pic.twitter.com/aebJuC9sD1
— Segata Lab (@cibiocm) 2020年4月14日

インストール

依存

Github

#bioconda(link) ここでは高速なmambaを使う
mamba create -n phylophlan python=3.8 -y
conda activate phylophlan
mamba install -c bioconda phylophlan -y
#任意でmashも導入
mamba install -c bioconda mash -y

#from source
git clone https://github.com/biobakery/phylophlan
cd phylophlan
python setup.py install

#graphlanも使う
mamba create -n graphlan python=2.7 -y
conda activate graphlan
mamba install -c bioconda graphlan -y

> phylophlan -h

$ phylophlan -h

usage: phylophlan [-h] [-i INPUT | -c CLEAN] [-o OUTPUT] [-d DATABASE]

[-t {n,a}] [-f CONFIG_FILE] --diversity {low,medium,high}

[--accurate | --fast] [--clean_all] [--database_list]

[-s SUBMAT] [--submat_list] [--submod_list] [--nproc NPROC]

[--min_num_proteins MIN_NUM_PROTEINS]

[--min_len_protein MIN_LEN_PROTEIN]

[--min_num_markers MIN_NUM_MARKERS]

[--trim {gap_trim,gap_perc,not_variant,greedy}]

[--gap_perc_threshold GAP_PERC_THRESHOLD]

[--not_variant_threshold NOT_VARIANT_THRESHOLD]

[--subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent}]

[--unknown_fraction UNKNOWN_FRACTION]

[--scoring_function {trident,muscle,random}] [--sort]

[--remove_fragmentary_entries]

[--fragmentary_threshold FRAGMENTARY_THRESHOLD]

[--min_num_entries MIN_NUM_ENTRIES] [--maas MAAS]

[--remove_only_gaps_entries] [--mutation_rates]

[--force_nucleotides] [--input_folder INPUT_FOLDER]

[--data_folder DATA_FOLDER]

[--databases_folder DATABASES_FOLDER]

[--submat_folder SUBMAT_FOLDER]

[--submod_folder SUBMOD_FOLDER]

[--configs_folder CONFIGS_FOLDER]

[--output_folder OUTPUT_FOLDER]

[--genome_extension GENOME_EXTENSION]

[--proteome_extension PROTEOME_EXTENSION] [--update]

[--verbose] [-v]

PhyloPhlAn is an accurate, rapid, and easy-to-use method for large-scale

microbial genome characterization and phylogenetic analysis at multiple levels

of resolution. PhyloPhlAn can assign finished, draft, or metagenome-assembled

genomes (MAGs) to species-level genome bins (SGBs). For individual clades of

interest (e.g. newly sequenced genome sets), PhyloPhlAn reconstructs strain-

level phylogenies from among the closest species using clade-specific

maximally informative markers. At the other extreme of resolution, PhyloPhlAn

scales to very-large phylogenies comprising >17,000 microbial species

optional arguments:

-h, --help show this help message and exit

-i INPUT, --input INPUT

Folder containing your input genomes and/or proteomes

(default: None)

-c CLEAN, --clean CLEAN

Clean the output and partial data produced for the

specified project (default: None)

-o OUTPUT, --output OUTPUT

Output folder name, otherwise it will be the name of

the input folder concatenated with the name of the

database used (default: None)

-d DATABASE, --database DATABASE

The name of the database of markers to use (default:

None)

-t {n,a}, --db_type {n,a}

Specify the type of the database of markers, where "n"

stands for nucleotides and "a" for amino acids. If not

specified, PhyloPhlAn will automatically detect the

type of database (default: None)

-f CONFIG_FILE, --config_file CONFIG_FILE

The configuration file to load, four ready-to-use

configuration files can be generated using the

"write_default_configs.sh" script present in the

"configs" folder (default: None)

--diversity {low,medium,high}

Specify the expected diversity of the phylogeny,

automatically adjust some parameters: "low": for

genus-/species-/strain-level phylogenies; "medium":

for class-/order-level phylogenies; "high": for

phylum-/tree-of-life size phylogenies (default: None)

--accurate Use more phylogenetic signal which can result in more

accurate phylogeny; affected parameters depend on the

"--diversity" level (default: False)

--fast Perform more a faster phylogeny reconstruction by

reducing the phylogenetic positions to use; affected

parameters depend on the "--diversity" level (default:

False)

--clean_all Remove all installation and database files

automatically generated (default: False)

--database_list List of all the available databases that can be

specified with the -d/--database option (default:

False)

-s SUBMAT, --submat SUBMAT

Specify the substitution matrix to use, available

substitution matrices can be listed with "--

submat_list" (default: None)

--submat_list List of all the available substitution matrices that

can be specified with the -s/--submat option (default:

False)

--submod_list List of all the available substitution models that can

be specified with the --maas option (default: False)

--nproc NPROC The number of cores to use (default: 1)

--min_num_proteins MIN_NUM_PROTEINS

Proteomes with less than this number of proteins will

be discarded (default: 1)

--min_len_protein MIN_LEN_PROTEIN

Proteins in proteomes shorter than this value will be

discarded (default: 50)

--min_num_markers MIN_NUM_MARKERS

Input genomes or proteomes that map to less than the

specified number of markers will be discarded

(default: 1)

--trim {gap_trim,gap_perc,not_variant,greedy}

Specify which type of trimming to perform: "gap_trim":

execute what specified in the "trim" section of the

configuration file; "gap_perc": remove columns with a

percentage of gaps above a certain threshold (see "--

gap_perc_threshold" parameter)"not_variant": remove

columns with at least one nucleotide/amino acid

appearing above a certain threshold (see "--

not_variant_threshold" parameter); "greedy": performs

all the above trimming steps; If not specified, no

trimming will be performed (default: None)

--gap_perc_threshold GAP_PERC_THRESHOLD

Specify the value used to consider a column not

variant when "--trim not_variant" is specified

(default: 0.67)

--not_variant_threshold NOT_VARIANT_THRESHOLD

Specify the value used to consider a column not

variant when "--trim not_variant" is specified

(default: 0.99)

--subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent}

The number of positions to retain from each single

marker, available option are: "phylophlan": specific

number of positions for each PhyloPhlAn marker (only

when "--database phylophlan"); "onethousand": return

the top 1000 positions; "sevenhundred": return the top

700; "fivehundred": return the top 500; "threehundred"

return the top 300; "onehundred": return the top 100

positions; "fifty": return the top 50 positions;

"twentyfive": return the top 25 positions;

"fiftypercent": return the top 50 percent positions;

"twentyfivepercent": return the top 25% positions;

"tenpercent": return the top 10% positions; If not

specified, the complete alignment will be used

(default: None)

--unknown_fraction UNKNOWN_FRACTION

Define the amount of unknowns ("X" and "-") allowed in

each column of the MSA of the markers (default: 0.3)

--scoring_function {trident,muscle,random}

Specify which scoring function to use to evaluate

columns in the MSA results (default: None)

--sort If specified, the markers will be ordered, when using

the PhyloPhlAn database, it will be automatically set

to "True" (default: False)

--remove_fragmentary_entries

If specified the MSAs will be checked and cleaned from

fragmentary entries. See --fragmentary_threshold for

the threshold values above which an entry will be

considered fragmentary (default: False)

--fragmentary_threshold FRAGMENTARY_THRESHOLD

The fraction of gaps in the MSA to be considered

fragmentary and hence discarded (default: 0.85)

--min_num_entries MIN_NUM_ENTRIES

The minimum number of entries to be present for each

of the markers in the database (default: 4)

--maas MAAS Select a mapping file that specifies the substitution

model of amino acid to use for each of the markers for

the gene tree reconstruction. File must be tab-

separated (default: None)

--remove_only_gaps_entries

If specified, entries in the MSAs composed only of

gaps ("-") will be removed. This is equivalent to

specify "--remove_fragmentary_entries

--fragmentary_threshold 1" (default: False)

--mutation_rates If specified will produced a mutation rates table for

each of the aligned markers and a summary table for

the concatenated MSA. This operation can take a long

time to finish (default: False)

--force_nucleotides If specified force PhyloPhlAn to use nucleotide

sequences for the phylogenetic analysis, even in the

case of a database of amino acids (default: False)

--update Update the databases file (default: False)

--verbose Makes PhyloPhlAn verbose (default: False)

-v, --version Prints the current PhyloPhlAn version and exit

Folder paths:

Parameters for setting the folder locations

--input_folder INPUT_FOLDER

Path to the folder containing the input data (default:

input/)

--data_folder DATA_FOLDER

Path to the folder where to store the intermediate

files, default is "tmp" inside the project's output

folder (default: None)

--databases_folder DATABASES_FOLDER

Path to the folder containing the database files

(default: phylophlan_databases/)

--submat_folder SUBMAT_FOLDER

Path to the folder containing the substitution

matrices to use to compute the column score for the

subsampling step (default:

phylophlan_substitution_matrices/)

--submod_folder SUBMOD_FOLDER

Path to the folder containing the mapping file with

substitution models for each marker for the gene tree

building (default: phylophlan_substitution_models/)

--configs_folder CONFIGS_FOLDER

Path to the folder containing the configuration files

(default: phylophlan_configs/)

--output_folder OUTPUT_FOLDER

Path to the output folder where to save the results

(default: )

Filename extensions:

Parameters for setting the extensions of the input files

--genome_extension GENOME_EXTENSION

Extension for input genomes (default: .fna)

--proteome_extension PROTEOME_EXTENSION

Extension for input proteomes (default: .faa)

> phylophlan_metagenomic -h

$ phylophlan_metagenomic -h

sh: 0: getcwd() failed: No such file or directory

usage: phylophlan_metagenomic [-h] [-i INPUT] [-o OUTPUT_PREFIX] [-d DATABASE]

[--database_list] [--database_update]

[-e INPUT_EXTENSION] [-n HOW_MANY]

[--nproc NPROC]

[--database_folder DATABASE_FOLDER]

[--only_input] [--add_ggb] [--add_fgb]

[--overwrite] [--verbose] [-v]

The phylophlan_metagenomic.py script assign SGB and taxonomy to a given set of

input genomes. Outputs can be of three types: (1) for each input genomes

returns the list of the closest -n/--how_many SGBs sorted by average Mash

distance; (2) for each input genomes returns the closest SGB, GGB, FGB, and

reference genomes; (3) returns a all vs. all matrix with all the pairwise mash

distances

optional arguments:

-h, --help show this help message and exit

-i INPUT, --input INPUT

Input folder containing the metagenomic bins to be

indexed (default: None)

-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX

Prefix used for the output folders: indexed bins,

distance estimations. If not specified, the input

folder will be used (default: None)

-d DATABASE, --database DATABASE

Database name, available options can be listed using

the --database_list parameter (default: None)

--database_list List of all the available databases that can be

specified with the -d/--database option (default:

False)

--database_update Update the databases file (default: False)

-e INPUT_EXTENSION, --input_extension INPUT_EXTENSION

Specify the extension of the input file(s) specified

via -i/--input. If not specified will try to infer it

from the input files (default: None)

-n HOW_MANY, --how_many HOW_MANY

Specify the number of SGBs to report in the output;

"all" is a special value to report all the SGBs; this

param is not used when "--only_input" is specified

(default: 10)

--nproc NPROC The number of CPUs to use (default: 1)

--database_folder DATABASE_FOLDER

Path to the folder that contains the database file

(default: phylophlan_databases/)

--only_input If specified provides a distance matrix between only

the input genomes provided (default: False)

--add_ggb If specified adds GGB assignments. If specified with

--add_fgb, then -n/--how_many will be set to 1 and

will be adding a column that reports the closest

reference genome (default: False)

--add_fgb If specified adds FGB assignments. If specified with

--add_ggb, then -n/--how_many will be set to 1 and

will be adding a column that reports the closest

reference genome (default: False)

--overwrite If specified overwrite the output file if exists

(default: False)

--verbose Prints more stuff (default: False)

-v, --version Prints the current phylophlan_metagenomic.py version

and exit

> phylophlan_get_reference -h

$ phylophlan_get_reference -h

usage: phylophlan_get_reference [-h] [-g GET | -l] [--database_update]

[-e OUTPUT_FILE_EXTENSION] [-o OUTPUT]

[-n HOW_MANY] [-m GENBANK_MAPPING] [--verbose]

[-v]

The phylophlan_get_reference.py script allows to download a specified number

(-n/--how_many) of reference genomes from the Genbank repository. Special case

"all" allows to download a specified number of reference genomes for all

available taxonomic species. With the -l/--list_clades params the

phylophlan_get_reference.py scripts returns the list of all species in the

database

optional arguments:

-h, --help show this help message and exit

-g GET, --get GET Specify the taxonomic label for which download the set

of reference genomes. The label must represent a valid

taxonomic level or the special case "all" (default:

None)

-l, --list_clades Print for all taxa the total number of species and

reference genomes available (default: False)

--database_update Update the databases file (default: False)

-e OUTPUT_FILE_EXTENSION, --output_file_extension OUTPUT_FILE_EXTENSION

Specify path to the extension of the output files

(default: .fna.gz)

-o OUTPUT, --output OUTPUT

Specify path to the output folder where to save the

files, required when -g/--get is specified (default:

None)

-n HOW_MANY, --how_many HOW_MANY

Specify how many reference genomes to download, where

-1 stands for "all available" (default: 4)

-m GENBANK_MAPPING, --genbank_mapping GENBANK_MAPPING

The local GenBank mapping file, if not found it will

be automatically downloaded (default:

assembly_summary_genbank.txt)

--verbose Prints more stuff (default: False)

-v, --version Prints the current phylophlan_get_reference.py version

and exit

> phylophlan_write_config_file -h

$ phylophlan_write_config_file -h

usage: phylophlan_write_config_file [-h] -o OUTPUT -d {n,a} (--db_dna {makeblastdb} | --db_aa {usearch,diamond}) [--map_dna {blastn,tblastn,diamond}] [--map_aa {usearch,diamond}] --msa {muscle,mafft,opal,upp}

[--trim {trimal}] [--gene_tree1 {fasttree,raxml,iqtree}] [--gene_tree2 {raxml}] --tree1 {fasttree,raxml,iqtree,astral,astrid} [--tree2 {raxml}] [-a] [--force_nucleotides] [--overwrite]

[--citation] [--verbose] [-v]

The phylophlan_write_config_file.py script generates a configuration file to be used with the phylophlan.py script. It implements some standard parameters for the software integrated, but if needed, the parameters of the

selected software can be added/modified/removed by editing the generated configuration file using a text editor

optional arguments:

-h, --help show this help message and exit

-o OUTPUT, --output OUTPUT

Specify the output file where to write the configurations (default: None)

-d {n,a}, --db_type {n,a}

Specify the type of the database, where "n" stands for nucleotides and "a" for amino acids (default: None)

--db_dna {makeblastdb}

Add the "db_dna" section of the selected software that will be used for building the indexed database (default: None)

--db_aa {usearch,diamond}

Add the "db_aa" section of the selected software that will be used for building the indexed database (default: None)

--map_dna {blastn,tblastn,diamond}

Add the "map_dna" section of the selected software that will be used for mapping the database against the input genomes (default: None)

--map_aa {usearch,diamond}

Add the "map_aa" section of the selected software that will be used for mapping the database against the input proteomes (default: None)

--msa {muscle,mafft,opal,upp}

Add the "msa" section of the selected software that will be used for producing the MSAs (default: None)

--trim {trimal} Add the "trim" section of the selected software that will be used for the gappy regions removal of the MSAs (default: None)

--gene_tree1 {fasttree,raxml,iqtree}

Add the "gene_tree1" section of the selected software that will be used for building the phylogenies for the markers in the database (default: None)

--gene_tree2 {raxml} Add the "gene_tree2" section of the selected software that will be used for refining the phylogenies previously built with what specified in the "gene_tree1" section (default: None)

--tree1 {fasttree,raxml,iqtree,astral,astrid}

Add the "tree1" section of the selected software that will be used for building the first phylogeny (default: None)

--tree2 {raxml} Add the "tree2" section of the selected software that will be used for refining the phylogeny previously built with what specified in the "tree1" section (default: None)

-a, --absolute_path Write the absolute path to the executable instead of the executable name as found in the system path environment (default: False)

--force_nucleotides If specified sets parameters for phylogenetic analysis software so that they use nucleotide sequences, even in the case of a database of amino acids (default: None)

--overwrite Overwrite output file if it exists (default: False)

--citation Show citation

--verbose Prints more stuff (default: False)

-v, --version Prints the current phylophlan_write_config_file.py version and exit

> phylophlan_setup_database -h

$ phylophlan_setup_database -h

usage: phylophlan_setup_database [-h] [-i INPUT | -g GET_CORE_PROTEINS] [--database_update] [-o OUTPUT] [-d DB_NAME] [-e INPUT_EXTENSION] [-t {n,a}] [-x OUTPUT_EXTENSION] [--overwrite] [--citation] [--verbose] [-v]

The phylophlan_setup_database.py script can be used to either format an input folder or multi-fasta file to be used as database in phylophlan.py, or automatically download a pre-identified set of core UniRef90 proteins for

the taxonomic label of a given species

optional arguments:

-h, --help show this help message and exit

-i INPUT, --input INPUT

Specify the path to either the folder containing the marker files or the file of markers, in (multi-)fasta format (default: None)

-g GET_CORE_PROTEINS, --get_core_proteins GET_CORE_PROTEINS

Specify the taxonomic label for which download the set of core proteins. The label must represent a species: "--get_core_proteins s__Escherichia_coli" (default: None)

--database_update Update the databases file (default: False)

-o OUTPUT, --output OUTPUT

Specify path to the output folder where to save the database (default: None)

-d DB_NAME, --db_name DB_NAME

Specify the name of the output database (default: None)

-e INPUT_EXTENSION, --input_extension INPUT_EXTENSION

Specify the extension of the input file(s) specified via -i/--input (default: None)

-t {n,a}, --db_type {n,a}

Specify the type of the database, where "n" stands for nucleotides and "a" for amino acids (default: None)

-x OUTPUT_EXTENSION, --output_extension OUTPUT_EXTENSION

Set the database output extension (default: None)

--overwrite If specified and the output file exists it will be overwritten (default: False)

--citation Show citation

--verbose Prints more stuff (default: False)

-v, --version Prints the current phylophlan_setup_database.py version and exit

graphlanのヘルプ

> graphlan.py -h

# graphlan.py -h

usage: graphlan.py [-h] [--format ['output_image_format']]

[--warnings WARNINGS] [--positions POSITIONS]

[--dpi image_dpi] [--size image_size] [--pad pad_in]

[--external_legends] [--avoid_reordering] [-v]

input_tree output_image

GraPhlAn 1.1.3 (5 June 2018) AUTHORS: Nicola Segata (nsegata@hsph.harvard.edu)

positional arguments:

input_tree the input tree in PhyloXML format

output_image the output image, the format is guessed from the

extension unless --format is given. Available file

formats are: png, pdf, ps, eps, svg

optional arguments:

-h, --help show this help message and exit

--format ['output_image_format']

set the format of the output image (default none

meaning that the format is guessed from the output

file extension)

--warnings WARNINGS set whether warning messages should be reported or not

(default 1)

--positions POSITIONS

set whether the absolute position of the points should

be reported on the standard output. The two

cohordinates are r and theta

--dpi image_dpi the dpi of the output image for non vectorial formats

--size image_size the size of the output image (in inches, default 7.0)

--pad pad_in the distance between the most external graphical

element and the border of the image

--external_legends specify whether the two external legends should be put

in separate file or keep them along with the image

(default behavior)

--avoid_reordering specify whether the tree will be reorder or not

(default the tree will be reordered)

-v, --version Prints the current GraPhlAn version and exit

> graphlan_annotate.py -h

# graphlan_annotate.py -h

usage: graphlan_annotate.py [-h] [--annot annotation_file] [-v]

input_tree [output_tree]

GraPhlAn annotate module 1.1.3 (5 June 2018) AUTHORS: Nicola Segata

(nsegata@hsph.harvard.edu)

positional arguments:

input_tree the input tree in Newick, Nexus, PhyloXML or plain

text format

output_tree the output tree in PhyloXML format containing the

newly added annotations. If not specified, the input

tree file will be overwritten

optional arguments:

-h, --help show this help message and exit

--annot annotation_file

specify the annotation file

-v, --version Prints the current GraPhlAn version and exit

全部で５つのチュートリアルが公開されている。

チュートリアル１- 単離ゲノム

Staphylococcus aureusの系統解析

１、解析対象のデータとデータベースの準備

#レポジトリをダウンロードしてチュートリアルディレクトリに移動。
git clone https://github.com/biobakery/phylophlan.git
cd phylophlan/phylophlan/examples/01_saureus/

#解析対象のゲノムのダウンロード
mkdir -p input_isolates
for i in $(cut -f4 135_saureus_isolates.tsv | sed 1d); do
 o=`basename $i | cut -f1 -d'.'`
 wget $i -O input_isolates/${o}.fna.gz
done

input_isolates/に135個のgenome fasta ファイルが保存される。

f:id:kazumaxneo:20210125172637p:plain

２、

#species:Staphylococcus_aureusについて、 UniRef90 core proteinsデータベースを作成。core protein.faaファイルがダウンロードされデータベースが作成される（sでspeciesランク）
phylophlan_setup_database -g s__Staphylococcus_aureus -o ./ \
 --verbose 2>&1 | tee logs/phylophlan_setup_database.log

３、phylophlanのランにはゲノムのfasta（またはアミノ酸 fasta）とconfigファイルが必要になる。phylophlan_write_config_fileを使ってconfigファイルを作成する。

#ランに必要なconfigファイルを作成する。使用するツールやパラメータ設定を指定する。
#protein
phylophlan_write_config_file \
 -o isolates_config.cfg \
 -d a \
 --force_nucleotides \
 --db_aa diamond \
 --map_aa diamond \
 --map_dna diamond \
 --msa mafft \
 --trim trimal \
 --tree1 fasttree \
 --tree2 raxml

#tutorialではproteinだがDNAの時は
phylophlan_write_config_file \
 -o isolates_config.cfg \
 -d n \
 --force_nucleotides \
 --db_aa diamond \
 --map_aa diamond \
 --map_dna blastn \
 --msa mafft \
 --trim trimal \
 --tree1 fasttree \
 --tree2 raxml

-g Specify the taxonomic label for which download the set of reference genomes. The label must represent a valid taxonomic level or the special case "all" (default: None)

４、phylophlanのラン（Staphylococcus aureusの系統再構成）

phylophlan \
 -i input_isolates \
 -o output_isolates \
 -d s__Staphylococcus_aureus \
 --trim greedy \
 --not_variant_threshold 0.99 \
 --remove_fragmentary_entries \
 --fragmentary_threshold 0.67 \
 --min_num_entries 135 \
 -t a \
 -f isolates_config.cfg \
 --diversity low \
 --nproc 20 \
 --proteome_extension faa \
 --verbose 2>&1 | tee logs/phylophlan__output_isolates.log

-i Folder containing your input genomes and/or proteomes (default: None)
-c Clean the output and partial data produced for the specified project (default: None)
-o Output folder name, otherwise it will be the name of the input folder concatenated with the name of the database used (default: None)
-d The name of the database of markers to use (default:None)
-t {n, a} Specify the type of the database of markers, where "n" stands for nucleotides and "a" for amino acids. If not specified, PhyloPhlAn will automatically detect the type of database (default: None)
-f The configuration file to load, four ready-to-use configuration files can be generated using the "write_default_configs.sh" script present in the "configs" folder (default: None)
--trim {gap_trim, gap_perc, not_variant, greedy} Specify which type of trimming to perform: "gap_trim": execute what specified in the "trim" section of the configuration file; "gap_perc": remove columns with a percentage of gaps above a certain threshold (see "-- gap_perc_threshold" parameter)"not_variant": remove columns with at least one nucleotide/amino acid appearing above a certain threshold (see "--not_variant_threshold" parameter); "greedy": performs all the above trimming steps; If not specified, no trimming will be performed (default: None)
--diversity {low, medium, high} Specify the expected diversity of the phylogeny, automatically adjust some parameters: "low": for genus-/species-/strain-level phylogenies; "medium":for class-/order-level phylogenies; "high": for phylum/tree-of-life size phylogenies (default: None)
--force_nucleotides If specified force PhyloPhlAn to use nucleotide sequences for the phylogenetic analysis, even in the case of a database of amino acids (default: False)
--nproc The number of cores to use (default: 1)
--verbose Makes PhyloPhlAn verbose (default: False)
--genome_extension Extension for input genomes (default: .fna)
--proteome_extension Extension for input proteomes (default: .faa)

phylophlanの出力　　　

f:id:kazumaxneo:20201215214000p:plain

公開されているリファレンスゲノムを加えたい場合はphylophlan_get_referenceを使う。

phylophlan_get_reference \
 -g s__Staphylococcus_aureus \
 -o input_references/ \
 -n 1000 \
 --verbose 2>&1 | tee logs/phylophlan_get_reference.log
#input_referencesディレクトリにGCA-genome.fnaがダウンロードされる（2020/12テスト時は135 genome）。

#step1で作ったinput_isolatesディレクトリのゲノムを今ダウンロードしたinput_referencesディレクトリにコピー
cp -a input_isolates/* input_references/

３、graphlanのgraphlan_annotate.pyコマンドで注釈をつけ、graphlan.pyコマンドでメタデータ付きの系統樹を出力　

#graphlanで注釈をつける
graphlan_annotate.py \
 --annot graphlan/isolates_annotation.txt \
 output_isolates/RAxML_bestTree.input_isolates_refined.tre \
 graphlan/isolates_annotated.xml

#視覚化
graphlan.py \
 --dpi 300 \
 graphlan/isolates_annotated.xml \
 graphlan/saureus_isolates.png

graphlanの出力

f:id:kazumaxneo:20201216114342p:plain

チュートリアル２- バクテリアの系統再構成

１、phylophlan_get_referenceを使ってゲノムを準備する。-gでtaxonを指定する。

phylophlan_get_reference -g all -o input_genomes/ -n 1 \
 --verbose 2>&1 | tee logs/phylophlan_get_reference.log

#-gで特定の分類を指定。-n -1で利用可能な全ゲノム。Escherichia属を指定。
phylophlan_get_reference -g g__escherichia -o input_genomes/ -n -1

-g Specify the taxonomic label for which download the set of reference genomes. The label must represent a valid taxonomic level or the special case "all" (default: None)
-l Print for all taxa the total number of species and reference genomes available (default: False)
--database_update Update the databases file (default: False)
-e Specify path to the extension of the output files (default: .fna.gz)
-o Specify path to the output folder where to save the files, required when -g/--get is specified (default: None)
-n Specify how many reference genomes to download, where -1 stands for "all available" (default: 4)

input_genomes/に9000超のゲノム（.fna.gz）がダウンロードされる。手持ちのゲノムも含めて解析したい場合、このディレクトリに追加してから２以降をランすればよい。

２、phylophlanの実行。ランにはゲノムのfasta（またはアミノ酸 fasta）とconfigファイルが必要。テストデータではphylophlan/examples/02/tol_config.cfgがこれに相当する。

tol_config.cfg

f:id:kazumaxneo:20200531190849p:plain

３、ゲノムのディレクトリとデータベースディレクトリ、そしてconfigファイルを指定してphylophlanを実行する。

phylophlan -i input_genomes \
 -d phylophlan \
 -f tol_config.cfg \
 --diversity high \
 --fast \
 -o output \
 --nproc 24 \
 --verbose 2>&1 | tee logs/phylophlan.log

-d The name of the database of markers to use (default: None)
-f The configuration file to load, four ready-to-use configuration files can be generated using the "write_default_configs.sh" script present in the "configs" folder (default: None)
--diversity {low,medium,high} Specify the expected diversity of the phylogeny,
automatically adjust some parameters: "low": for genus-/species-/strain-level phylogenies; "medium": for class-/order-level phylogenies; "high": for phylum-/tree-of-life size phylogenies (default: None)
--nproc The number of cores to use (default: 1)

拡張子がfastaなら --genome_extension fastaと指定する。結果はoutput/ に出力される（*１）。

３、prokaryotesの系統再構成。

phylophlan \
 -i input_genomes \
 -d phylophlan \
 -f 02_tol.cfg \
 --diversity high \
 --fast \
 -o output_tol \
 --nproc 16 \
 --verbose 2>&1 | tee logs/phylophlan.log

チュートリアル3 - メタゲノム

１、データの準備 (PRJNA504891が使われている)

wget https://www.dropbox.com/s/fuafzwj67tguj31/ethiopian_mags.tar.bz2?dl=1 -O ethiopian_mags.tar.bz2
mkdir -p input_metagenomic
tar -xjf ethiopian_mags.tar.bz2 -C input_metagenomic/

２、Mashを使ってmash djistanceが95%を超えるsingle taxnomyをアサインする。phylophlan_metagenomicスクリプトを使う。

phylophlan_metagenomic -i input_metagenomic \
 -o output_metagenomic \
 --nproc 20 \
 -n 1 \
 -d SGB.Jan19 \
 --verbose 2>&1 | tee logs/phylophlan_metagenomic.log

-d Database name, available options can be listed using the --database_list parameter (default: None)

３、メタゲノムで見つかったトップ21のSGBを可視化。

phylophlan_draw_metagenomic -i output_metagenomic.tsv \
 -o output_heatmap \
 --map bin2meta.tsv \
 --top 20 \
 --verbose 2>&1 | tee logs/phylophlan_draw_metagenomic.log

各ゲノムビンと、それがアセンブルされたメタゲノムとのマッピングファイルを用意して指定している。マッピングファイルはタブ区切りのテキストファイルで、1列目にゲノムビン、2列目に対応するメタゲノムを書く。チュートリアルでは提供されているマッピングファイルbin2meta.tsv（link）を指定している。

出力

f:id:kazumaxneo:20200531113142p:plain

f:id:kazumaxneo:20200531113138p:plain

f:id:kazumaxneo:20200531113136p:plain

実際のラン例

手持ちのゲノムを分析する。ここではVibrio属のゲノムを使用する。まずprokkaでアノテーションをつけてprotein.faaを出力した。それからPhyloPhlAnで系統推定する。

１、コアタンパク質のデータベースを準備する。

phylophlan_setup_database -g s__Vibrio_cholerae -o ./ \
 --verbose 2>&1 | tee logs/phylophlan_setup_database.log

マーカー遺伝子のディレクトリs__Vibrio_cholerae/と関連ファイルが出力される。

”-g”で指定する分類群は種レベルのみ可能。種名は大文字、小文字も含めて判別されるのでタイプする時は注意する。

２、次にconfigを作成する。

#configファイル作成
#protein
phylophlan_write_config_file \
 -o isolates_config.cfg \
 -d a \
 --db_aa diamond \
 --map_aa diamond \
 --map_dna diamond \
 --msa mafft \
 --trim trimal \
 --tree1 fasttree \
 --tree2 raxml

３、phylophlanのラン。105個のprotein.faaがあって、その全てに存在するマーカー遺伝子だけを使用する。s__Vibrio_choleraeは1で作成したs__Vibrio_choleraeのディレクトリパス。

phylophlan \
 -i protein_dir \
 -o output_isolates \
 -d s__Vibrio_cholerae \
 --trim greedy \
 --not_variant_threshold 0.99 \
 --remove_fragmentary_entries \
 --fragmentary_threshold 0.67 \
 --min_num_entries 105 \
 -t a \
 -f isolates_config.cfg \
 --diversity low \
 --nproc 40 \
 --verbose 2>&1 | tee logs/phylophlan__output_isolates.log

--min_num_entries The minimum number of entries to be present for each
of the markers in the database (default: 4)
--force_nucleotides If specified force PhyloPhlAn to use nucleotide
sequences for the phylogenetic analysis, even in the case of a database of amino acids (default: False)

20211110

データベースのダウンロードに失敗する場合

https://forum.biobakery.org/t/database-download-error/2770

引用

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0

Francesco Asnicar, Andrew Maltez Thomas, Francesco Beghini, Claudia Mengoni, Serena Manara, Paolo Manghi, Qiyun Zhu, Mattia Bolzan, Fabio Cumbo, Uyen May, Jon G. Sanders, Moreno Zolfo, Evguenia Kopylova, Edoardo Pasolli, Rob Knight, Siavash Mirarab, Curtis Huttenhower & Nicola Segata
Nature Communications volume 11, Article number: 2500 (2020)

*１

3990xの計算機で64スレッド指定実行すると3日ほどかかった。