近縁な何百~何千のバクテリアの系統解析を行うGubbins

2022 1/26 インストール手順変更

2024/04/08 追記

ハイスループット第二世代のDNAシーケンス技術が導入されて以来、細菌集団の系統力学を推定するために使用されるデータセットのサイズが非常に大きくなってきている。多くの系統学的手法は数百の細菌ゲノムに拡張可能であるが、配列の水平転移のメカニズムが系統再構成に与える影響を緩和するために使用されてきた手法は、これらの新しいデータセットには対応できない。Gubbins (Genealogies Unbiased By recomBinations In Nucleotide Sequences)は、高い塩基置換密度を持つ遺伝子座を反復的に同定するアルゴリズムであり、同時にこれらの領域の外側にあると考えられる点突然変異に基づいて系統マップを構築する。シミュレーションにより、このアルゴリズムが細菌の短期進化の現実的なモデルの下で非常に正確な再構成を生成し、数百の細菌ゲノム配列のアラインメントをわずか数時間で実行できることが実証された。

追記

このアルゴリズムでは、まず入力アライメントから多型部位を検出し、RAxMLまたはFastTree 2を用いて最尤系統樹を生成する。いずれの場合も、置換の一般的な時間可逆(GTR)モデルを用い、部位間の割合の不均一性を考慮するためにガンマ分布を用いる。次に、多型部位のアラインメント、FastMLを使用した樹形トポロジー、枝の長さに基づいて、塩基置換の祖先配列の共同再構築を行い、この場合もGTR置換モデルを使用する。この処理により、系統の各ブランチ上で発生した塩基置換と、ブランチ上で同定された塩基の数（すなわち、ブランチ上で未確定であった塩基を除く）でそれらの間の距離の両方が定義される。
帰無仮説として塩基置換がゲノム全体にわたってランダムに点突然変異によって起こると仮定し、多様性の高い領域；突然変異の「ホットスポット」を検出する。それは長い枝の期間での、relaxed selectionまたは正の選択下にある領域を示している可能性がある（系統解析にノイズを導入する可能性が高い）。
そのような配列を欠損データで置き換えることによって、推定される組換え事象をマスキングした後、2番目の最尤系統樹と配列再構築を、それぞれ指定された系統樹ソフトウェアとFastMLを用いて作成する。したがって、この2番目のツリーは、検出された組換えで発生した塩基置換の影響を受けず、より正確な祖先配列の再構築につながるはずである。アルゴリズムは次に、収束が最初に起こらない限り、あらかじめ指定された最大閾値（デフォルトでは5回に設定）に達するまで反復的に繰り返される。

インストール

macos Montereyにて、condaで環境を作って導入した。

Github

#bioconda
mamba create -n gubbins -y
conda activate gubbins 
mamba install -c bioconda -y gubbins

#homebrew
brew install gubbins

> gubbins

$ gubbins

Error: File '' does not exist

This program is not supposed to be directly run. Use run_gubbins.py instead

Usage: gubbins [options] alignment_file

Version: 3.1.6

-r detect recombinations mode

-t Newick tree file

-v VCF file

-f Original Multifasta file

-m Min SNPs for identifying a recombination block

-a Min window size

-b Max window size

-h Display this usage information.

> run_gubbins.py -h

$ run_gubbins.py -h

usage: run_gubbins.py [-h] [--prefix PREFIX] [--starting-tree STARTING_TREE] [--date DATE] [--use-time-stamp] [--version] [--threads THREADS] [--verbose] [--no-cleanup] [--pairwise]

[--filter-percentage FILTER_PERCENTAGE] [--remove-identical-sequences] [--tree-builder {raxml,raxmlng,iqtree,iqtree-fast,fasttree,hybrid,rapidnj}] [--tree-args TREE_ARGS]

[--first-tree-builder {raxml,raxmlng,iqtree,iqtree-fast,fasttree,rapidnj,star}] [--first-tree-args FIRST_TREE_ARGS] [--outgroup OUTGROUP] [--bootstrap BOOTSTRAP]

[--transfer-bootstrap] [--sh-test] [--seed SEED] [--model {JC,K2P,HKY,GTR,GTRGAMMA,GTRCAT}] [--first-model {JC,K2P,HKY,GTR,GTRGAMMA,GTRCAT}] [--best-model]

[--custom-model CUSTOM_MODEL] [--custom-first-model CUSTOM_FIRST_MODEL] [--model-fitter {raxml,raxmlng,iqtree,fasttree,None}] [--recon-model {JC,K2P,HKY,GTR,GTRGAMMA,GTRCAT}]

[--custom-recon-model CUSTOM_RECON_MODEL] [--recon-with-dates] [--model-fitter-args MODEL_FITTER_ARGS] [--mar] [--seq-recon {raxml,raxmlng,iqtree,None}]

[--seq-recon-args SEQ_RECON_ARGS] [--min-snps MIN_SNPS] [--min-window-size MIN_WINDOW_SIZE] [--max-window-size MAX_WINDOW_SIZE] [--p-value P_VALUE]

[--trimming-ratio TRIMMING_RATIO] [--extensive-search] [--iterations ITERATIONS] [--converge-method {weighted_robinson_foulds,robinson_foulds,recombination}] [--resume RESUME]

alignment_filename

Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. "Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome

sequences using Gubbins". Nucleic Acids Res. 2015 Feb 18;43(3):e15. doi: 10.1093/nar/gku1196.

optional arguments:

-h, --help show this help message and exit

Input and output options:

alignment_filename Multifasta alignment file

--prefix PREFIX, -p PREFIX

Add a prefix to the final output filenames (default: None)

--starting-tree STARTING_TREE, -s STARTING_TREE

Starting tree (default: None)

--date DATE, -D DATE Two-column text file in which the second column is the date of isolation in YYYY,YYYY-MM or YYYY-MM-DD format (default: None)

--use-time-stamp, -u Use a time stamp in file names (default: False)

--version show program's version number and exit

--threads THREADS, -c THREADS

Number of threads to use for parallelisation (default: 1)

--verbose, -v Turn on debugging (default: False)

--no-cleanup, -n Do not cleanup intermediate files (default: False)

Data processing options:

--pairwise Compare two sequences (without using a tree) (default: False)

--filter-percentage FILTER_PERCENTAGE, -f FILTER_PERCENTAGE

Filter out taxa with more than this percentage of gaps (default: 25.0)

--remove-identical-sequences, -d

Remove identical sequences (default: False)

Tree building options:

--tree-builder {raxml,raxmlng,iqtree,iqtree-fast,fasttree,hybrid,rapidnj}, -t {raxml,raxmlng,iqtree,iqtree-fast,fasttree,hybrid,rapidnj}

Application to use for tree building (default: raxml)

--tree-args TREE_ARGS

Quoted string of further arguments passed to tree building algorithm (start string with a space if there is a risk of being interpreted as a flag) (default: None)

--first-tree-builder {raxml,raxmlng,iqtree,iqtree-fast,fasttree,rapidnj,star}

Application to use for building the first tree (default: None)

--first-tree-args FIRST_TREE_ARGS

Further arguments passed to first tree building algorithm (default: None)

--outgroup OUTGROUP, -o OUTGROUP

Outgroup name for rerooting. A list of comma separated names can be used if they form a clade (default: None)

--bootstrap BOOTSTRAP, -# BOOTSTRAP

Number of bootstrap replicates to perform with final alignment (default: 0)

--transfer-bootstrap Calculate bootstrap supporting transfer bootstrap expectation (default: False)

--sh-test Perform an SH test of node likelihoods (default: False)

--seed SEED Set seed for reproducibility of analysis (default: None)

Nucleotide substitution model options:

--model {JC,K2P,HKY,GTR,GTRGAMMA,GTRCAT}, -M {JC,K2P,HKY,GTR,GTRGAMMA,GTRCAT}

Nucleotide substitution model (not all available for all tree building algorithms) (default: None)

--first-model {JC,K2P,HKY,GTR,GTRGAMMA,GTRCAT}

Nucleotide substitution model used for first tree (default: None)

--best-model Automatically select best substitution model using iqtree in later iterations (default: False)

--custom-model CUSTOM_MODEL

String corresponding to a substitution model for the selected tree building algorithm (default: None)

--custom-first-model CUSTOM_FIRST_MODEL

String corresponding to a substitution model for the selected tree building algorithm for the first iteration (default: None)

Ancestral sequence reconstruction options:

--model-fitter {raxml,raxmlng,iqtree,fasttree,None}, -F {raxml,raxmlng,iqtree,fasttree,None}

Application to use for model fitting for joint ancestral state reconstruction [if unspecified: same as tree builder if possible, else iqtree] (default: None)

--recon-model {JC,K2P,HKY,GTR,GTRGAMMA,GTRCAT}, -R {JC,K2P,HKY,GTR,GTRGAMMA,GTRCAT}

Nucleotide substitution model used for ancestral state reconstruction (not all available for all tree building algorithms) (default: GTRGAMMA)

--custom-recon-model CUSTOM_RECON_MODEL

String corresponding to a substitution model for the selected model fitting algorithm (default: None)

--recon-with-dates Use isolate date information in ancestral joint sequence reconstruction (default: False)

--model-fitter-args MODEL_FITTER_ARGS

Further arguments passed to model fitting algorithm (default: None)

--mar Use marginal, rather than joint, ancestral reconstruction (default: False)

--seq-recon {raxml,raxmlng,iqtree,None}

Algorithm to use for marginal reconstruction [if unspecified: same as tree builder if possible, else iqtree; requires --mar flag] (default: None)

--seq-recon-args SEQ_RECON_ARGS

Further arguments passed to sequence reconstruction algorithm (start string with a space if there is a risk of being interpreted as a flag) (default: None)

Recombination detection options:

--min-snps MIN_SNPS, -m MIN_SNPS

Min SNPs to identify a recombination block (default: 3)

--min-window-size MIN_WINDOW_SIZE, -a MIN_WINDOW_SIZE

Minimum window size (default: 100)

--max-window-size MAX_WINDOW_SIZE, -b MAX_WINDOW_SIZE

Maximum window size (default: 10000)

--p-value P_VALUE Uncorrected p value used to identify recombinations (default: 0.05)

--trimming-ratio TRIMMING_RATIO

Ratio of log probabilities used to trim recombinations (default: 1.0)

--extensive-search Undertake slower, more thorough, search for recombination (default: False)

Algorithm start/stop options:

--iterations ITERATIONS, -i ITERATIONS

Maximum No. of iterations (default: 5)

--converge-method {weighted_robinson_foulds,robinson_foulds,recombination}, -z {weighted_robinson_foulds,robinson_foulds,recombination}

Criteria to use to know when to halt iterations (default: weighted_robinson_foulds)

--resume RESUME Intermediate tree from previous run (must include "iteration_X" in file name) (default: None)

ラン

ランするにはマルチプルアライメント結果のファイルが必要。塩基置換の空間分布を解析するために、全ゲノムレベルのアラインメントが要求される点に注意。マニュアルではゲノムのマルチプルアライメントのツール例としてSnippyが挙げられている（Snippyの使い方）。

論文中のMSAファイルがダウンロードできる。これを使う。ダウンロードしたalnファイルを指定して実行。初回の樹形推定のオプションと2回目以降の樹形推定のオプションを指定する。

run_gubbins.py --prefix PMEN1 --first-tree-builder rapidnj --first-model JC --tree-builder raxmlng --model GTR PMEN1.aln

--first-tree-builder {raxml ,raxmlng ,iqtree ,iqtree-fast ,fasttree ,rapidnj, star}
Application to use for building the first tree (default: None)
--first-model {JC, K2P, HKY, GTR, GTRGAMMA, GTRCAT}
Nucleotide substitution model used for first tree (default: None)
--tree-builder {raxml, raxmlng, iqtree, iqtree-fast, fasttree, hybrid, rapidnj}
Application to use for tree building (default: raxml)
--model {JC, K2P, HKY, GTR, GTRGAMMA, GTRCAT} Nucleotide substitution model (not all available for all tree building algorithms) (default: None)