GET_PHYLOMARKERS - macでインフォマティクス

　ゲノム配列が公開データベースに大量に蓄積されたことにより、生物学研究の多くの分野でゲノムレベルの系統解析が盛んに行われるようになった。しかし、様々な進化や遺伝的過程により、多くの遺伝子座が系統樹の再構築には好ましくない特性を持っている。このような特性が検出されない場合、特に連結されたデータセットからツリーを推定する際に、誤った推定値や偏った推定値が得られる可能性がある。これらの問題に対処するために、GET_PHYLOMARKERSを開発した。これは、orthologous clustersや、GET_HOMOLOGUESで計算されたpan-genome matrix (PGM)から、ロバストなゲノム系統を推定するための高品質なマーカーを同定するパイプラインである。最初に、一連のフィルターを適用して、recombinant alignmentsや、異常なツリーを生成するもの、解決が不十分なツリーを除外する。マルチプルシークエンスアラインメントと最尤（ML）系統樹は、マルチコアコンピュータ上で並列に計算される。ML系統樹は、FastTreeまたはIQ-TREE(IQT)を使用して、DNAまたはタンパク質レベルでトップランクのアラインメントを連結したセットから推定される。さらに、PGMから parsimony phylogenies およびML phylogenies を推定することもできる。本研究では、RefSeqに登録されている170のStenotrophomonasゲノム配列と、ここで報告されたメキシコのS. maltophilia complex (Smc)分離株の10の新規完全ゲノムを解析することで、本ソフトウェアの実用性を示した。また、コアゲノム解析とPGM解析を組み合わせて、本属の分子系統学を再検討した。クラスタリングの善し悪しの統計を用いた教師なし学習法により、コア・ゲノム平均ヌクレオチド同一性（cgANIb）が95.9％のSmc内の20グループが同定され、コア・ゲノムおよびパン・ゲノムツリーで強く支持されているクレードと完全に一致した。さらに、16個の誤って分類されたRefSeqゲノム配列を同定し、そのうち14個をS. maltophiliaとラベル付けした。このことは、系統学や遺伝子分類学の研究に本ソフトウェアが幅広く利用できることを示している。コード，詳細なマニュアル，チュートリアルは，GNU GPLv3ライセンスの下，Linux/UNIXサーバで自由に利用することができる（https://github.com/vinuesa/get_phylomarkers）．また，GET_PHYLOMARKERS と GET_HOMOLOGUES をバンドルした docker イメージが https://hub.docker.com/r/csicunam/get_homologues/ で公開されており，あらゆるプラットフォームで簡単に実行することができる。

manual

https://vinuesa.github.io/get_phylomarkers/#get_phylomarkers-manual

tutorial

GET_PHYLOMARKERS tutorial

インストール

Github

Docker Hub

docker pull csicunam/get_homologues
docker run --rm -it csicunam/get_homologues:latest /bin/bash

> run_get_phylomarkers_pipeline.sh -h

run_get_phylomarkers_pipeline.sh v.2.2.9.1_30Jan2021 OPTIONS:

REQUIRED:

-R <integer> RUNMODE

1 select optimal markers for phylogenetics/phylogenomics (genomes from different species)

2 select optimal markers for population genetics (genomes from the same species)

-t <string> type of input sequences: DNA|PROT

OPTIONAL:

-h flag, print this short help notes

-H flag, print extensive Help and details about search modes and models

-A <string> Algorithm for tree searching: <F|I> [FastTree|IQ-TREE] [default:I]

-c <integer> NCBI codontable number (1-23) for pal2nal.pl to generate codon alignment [default:11]

-C flag to print codontables

-D flag to print debugging messages; please use if you encounter problems executing the code [default: 0]

-e <integer> select gene trees with at least (min. = 4) external branches [default: 4]

-f <string> GET_HOMOLOGUES cluster format <STD|EST> [default: STD]

-k <real> kde stringency (0.7-1.6 are reasonable values; less is more stringent) [default: 1.5]

-K flag to run molecular clock test on codon alignments [default: 0]

-l <integer> max. spr length (7-12 are reasonable values) [default: 10]

-m <real> min. average support value (0.7-0.8 are reasonable values) for trees to be selected [default: 0.7]

-M <string> base Model for clock test (use one of: GTR|TrN|HKY|K2P|F81); uses +G in all cases [default: GTR]

-n <integer> number of cores/threads to use [default: all cores]

-N <integer> number of IQ-TREE searches to run [only active with -T high] [default: 5]

-q <real> quantile (0.95|0.99) of Chi-square distribution for computing molec. clock p-value [default: 0.99]

-r <string> root method (midpoint|outgroup) [default: midpoint]

-s <integer> number of spr rounds (4-20 are reasonable values) for FastTree tree searching [default: 4]

-S <string> quoted 'comma-separated list' of base models to be evaluated by IQ-TREE

when estimating the species tree from the concatenated supermatrix (see -H for details).

If no -S is passed, then sinlge default models are used, as shown below

<'JC,F81,K2P,HKY,TrN,TNe,K3P,K81u,TPM2,TPM2u,TPM3,TPM3u,

TIM,TIMe,TIM2,TIM2e,TIM3,TIM3e,TVM,TVMe,SYM,GTR'> for DNA alignments [default: GTR]

<'BLOSUM62,cpREV,Dayhoff,DCMut,FLU,HIVb,HIVw,JTT,JTTDCMut,LG,

mtART,mtMAM,mtREV,mtZOA,Poisson,PMB,rtREV,VT,WAG'> for PROT alignments [default: LG]

-T <string> tree search Thoroughness: high|medium|low|lowest (see -H for details) [default: medium]

-v flag, print version and exit

-V flag, print software versions

INVOCATION EXAMPLES:

1. default on DNA sequences (uses IQ-TREE evaluating a subset of models specified in the detailed help)

run_get_phylomarkers_pipeline.sh -R 1 -t DNA

2. thorough FastTree searching and molecular clock analysis on DNA sequences using 10 cores:

run_get_phylomarkers_pipeline.sh -R 1 -t DNA -A F -k 1.2 -m 0.7 -s 20 -l 12 -T high -K -M HKY -q 0.95 -n 10

3. FastTree searching on a huge protein dataset for fast inspection

run_get_phylomarkers_pipeline.sh -R 1 -A F -t PROT -m 0.6 -k 1.0 -T lowest

4. To run the pipeline on a remote server, we recommend using the nohup command upfront, as shown below:

nohup run_get_phylomarkers_pipeline.sh -R 1 -t DNA -S 'TNe,TVM,TVMe,GTR' -k 1.0 -m 0.75 -T high -N 5 &> /dev/null &

5. Run in population-genetics mode (generates a table with descritive statistics for DNA-polymorphisms

and the results of diverse neutrality test)

run_get_phylomarkers_pipeline.sh -R 2 -t DNA

NOTES

1: run from within the directory holding core gene clusters generated by get_homologues.pl -e or

compare_clusters.pl with -t no_genome_gbk files (core genome: all clusters with a single gene copy per genome)

2: If you encounter any problems, please run the script with the -D -V flags added at the end of the command line,

redirect STOUT to a file and send us the output, so that we can better diagnose the problem.

e.g. run_get_phylomarkers_pipeline.sh -R 1 -t DNA -k 1.0 -m 0.7 -s 8 -l 10 -T high -K -D &> ERROR.log

実行方法

get_homologues.pl -eで生成されたコアゲノム遺伝子クラスター、またはGET_HOMOLOGUESパッケージのcompare_clusters.pl -t number_of_genomesで生成されたコンセンサス(intersection OMCL,COGS,BDBH)クラスタが格納されているディレクトリ内で実行する。

run_get_phylomarkers_pipeline.sh -R 1 -t DNA

git clone https://github.com/vinuesa/get_phylomarkers.git

-R RUNMODE
1 select optimal markers for phylogenetics/phylogenomics (genomes from different species)
2 select optimal markers for population genetics (genomes from the same species)
-t type of input sequences: DNA|PROT

runmodesには、-R 1（系統学用）と-R 2（集団遺伝学用）の2種類がある。
パイプラインはDNAまたはタンパク質配列(-t DNA|PROT)の2つの分子タイプで実行できる。タンパク質は、（属レベル以上のような）より分岐したゲノム配列の解析を目的としている。
バージョン 2.0 以降、FastTree (FT) または IQ-TREE (IQT) の高速 ML ツリー探索アルゴリズムを使用している。-A [F | I] オプション [デフォルト: I] で制御する。
DNA fastaファイルからコドンアライメントを生成するには、.faaファイルと.fnaファイルの両方が必要。、GET_HOMOLOGUES パッケージの compare_clusters.pl を 2 回実行する必要があり、そのうち 1 回は -n フラグを使用する。
クラスターを評価するためには、最低でも4つの異なるゲノムを入力として使用する必要がある。

テストラン

１、テストデータを用意。

docker run -itv $PWD:/data --rm -it csicunam/get_homologues:latest /bin/bash
cd
mkdir get_homPhy
git clone https://github.com/vinuesa/get_phylomarkers.git
cp -r get_phylomarkers/test_sequences/ get_homPhy/
cd get_homPhy/test_sequences/

pIncAC/

f:id:kazumaxneo:20210319170011p:plain

これらのgenbankファイルを使う。

２、GET_HOMOLOGUESスイートに実装されている 3 つのクラスタリングアルゴリズムを実行する。

docker run -itv $PWD:/data --rm -it csicunam/get_homologues:latest /bin/bash
cd
mkdir get_homPhy
git clone https://github.com/vinuesa/get_phylomarkers.git
cp -r get_phylomarkers/test_sequences/ get_homPhy/
cd get_homPhy/test_sequences/

#Computing BDBH clusters (genbankファイルが12個置いてあるディレクトリを指定している)
get_homologues.pl -d pIncAC -t 12 -e -n 1
# => pIncAC_homologuesができる。

#COGtriangles algorithm
get_homologues.pl -d pIncAC -G -t 0

#OMCL algorithm
get_homologues.pl -d pIncAC/ -M -t 0

３、それから、compare_clusters.plを使って、コンセンサスコアゲノムのクラスタを計算する。
このスクリプトは2回実行する。1回目は -n フラグをつけて実行 (-nをつけるとコンセンサスクラスターをDNAレベルで計算し、*.fnaファイルを生成する)。2回目はフラグを付けずに実行し、タンパク質のクラスターを取得する。

cd pIncAC_homologues/
find . -type d

#1st run
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12 -n

#2nd run
compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBH_e1_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o core_BCM -t 12

core_BCM/ #1st run

f:id:kazumaxneo:20210319165223p:plain

core_BCM/ #2nd run

f:id:kazumaxneo:20210319165547p:plain

venn_t12.pdf

f:id:kazumaxneo:20210319165750p:plain

12個の入力配列に対する31個の遺伝子のコンセンサスコアゲノムを描いたベン図。

４、COGtrianglesおよびOMCLクラスターのGET_HOMOLOGUESを使ったコンセンサスパンゲノムの計算。コンセンサスパンゲノムの計算には、参照配列を必要としないCOGtrianglesおよびOMCLアルゴリズムのみを使用する。GET_HOMOLOGUESに-t 0オプションを付けて実行すると、シングルトンからマルチジーン・ファミリーまで、すべての占有サイズのクラスターを考慮することができる。このスクリプトは2回実行する。１回目はDNAレベルのコンセスクラスター、*.fnaファイル、およびパンゲノムマトリックスを計算する。２回目はタンパク質のクラスタ（*.faaファイル）を取得するために-n -m フラグを指定しないで実行する。

compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM -n -m

compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOG_e0_,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCL_e0_ -o pan_CM

（BDBHクラスターを格納しているディレクトリは除外している。これは、 BDBHクラスターが常に参照配列を含んでおり、適切なパンゲノム・マトリックスの計算には適していないから）。

pan_CM #2nd run

f:id:kazumaxneo:20210319170848p:plain

venn_t0.pdf

f:id:kazumaxneo:20210319170828p:plain

12個の入力配列に対するコンセンサスパンゲノム・クラスタを描いたベン図には373のクラスタが含まれている。対応するパンゲノム・マトリックスはtest_sequences/pan_genome/ディレクトリに保存されている。

５、GET_PHYLOMARKERSをランし、最適なコアゲノム系統マップを探索する（ここではpIncA/Cプラスミド）。

cd core_BCM
run_get_phylomarkers_pipeline.sh -R 1 -t DNA

まず、入力された.*fnaファイルと.faaファイルには、同じ数の配列と1つのeackゲノムインスタンスが含まれていることが報告される。次に、31のコンセンサス遺伝子座について、複数の配列のアラインメントを並行して実行し、コドンアラインメントを生成する。phi(w)テストは、パイプラインの最初のフィルターで、組換え配列を識別する。いくつかの遺伝子座では、テストを実行するのに十分な多型がないため、警告メッセージが表示される。テストセットには組換え遺伝子座の証拠はない。

IQ-TREE遺伝子系統樹の検索が並行して行われる。kdetreesテストでは、「外れ値の系統」を特定し、5枝以下の「ツリー」を生成する5つの遺伝子座とともに1つの外れ値を特定した。
残りの25個の遺伝子座について、遺伝子ツリーの平均サポート値を計算し、結果が良くないものを破棄する。このデータセットでは10個。前述のフィルターを通過した15個のアラインメントからスーパーマトリックスが生成される。IQ-TREE は、超行列に最適なモデルとツリーを探し、見つかった最適なモデルと lnL スコアを報告する。スクリプトは最後に、異なる作業ディレクトリでいくつかのクリーンアップを行う。

引用

GET_PHYLOMARKERS, a Software Package to Select Optimal Orthologous Clusters for Phylogenomics and Inferring Pan-Genome Phylogenies, Used for a Critical Geno-Taxonomic Revision of the Genus Stenotrophomonas

Pablo Vinuesa, Luz E. Ochoa-Sánchez, Bruno Contreras-Moreira

Front. Microbiol., 01 May 2018