2020-02-21

パンゲノム解析を行う roary

docker pan-genome 系統解析結果の視覚化 (visualization) bacteria 2015 Bioinformatics PAV (presence/absence variations)

2020 3/19 4/6 スクリプト修正、3/19 4/10 サンプル数が多い時のオプション追記、4/13 追記、5/11 リンク修正、5/25 わかりにくい文章を修正、roaryのランコマンド修正、インストール手順追記、5/27 コメント追加、6/7 ML法のコマンド追加,、proka並列ラン例追記、6/9 --format pdfオプション追記、9/9 インストール修正、わかりにくい説明を修正、11/4 step3の"--fomat pdf"を削除

2021 9/10 dockerのコマンド修正

2022/07/06 アノテーション並列化例のコマンド修正, 8/19 インストールコマンド修正, 10/3 prodigal追記, 10/7 iqtree追記

20231207 スクリプト追記, 2023/12/15 prokka 並列処理コマンド修正（locus tag追加）

2024/03/24 prokka |prodigalの--meta削除

　現在、典型的な原核生物のポピュレーションシーケンシング研究は、数百または数千の分離株で構成されている。これらのデータセットを調べることで、原核生物ゲノムの遺伝構造に関する詳細な洞察を得ることができる。ここではRoaryを紹介する。Roaryは、大規模なパンゲノムを迅速に構築し、コア遺伝子とアクセサリー遺伝子を特定するツールである。 Roaryは、結果の精度を損なうことなく、標準デスクトップ上で数千の原核生物サンプルのパンゲノムの構築を可能にする。単一のCPU Roaryを使用すると、13 GBのRAMを使用して4.5時間で1000の分離株からなるパンゲノムを生成でき、複数のプロセッサを使用してさらに高速化できる。RoaryはPerlで実装されており、http;//sanger-pathogens.github.io/RoaryのオープンソースGPLv3ライセンスの下で自由に利用できる。

manual

https://sanger-pathogens.github.io/Roary/

インストール

macos10.14のanaconda3.7環境でテストした。

=> condasでpython3.6の仮想環境を作ってテストするように修正した。

本体　Github

mamba create -n roary python=3.6 -y
conda activate roary

#bioconda (link)。fasttreeなども導入される（yamlファイル）。
mamba install -c bioconda -y roary 
#視覚化には他のライブラリも必要
mamba install seaborn matplotlib biopython -y

> roary -h

$ roary -h

Use of uninitialized value in require at /Users/kazu/anaconda3/envs/roary/lib/site_perl/5.26.2/darwin-thread-multi-2level/Encode.pm line 61.

Please cite Roary if you use any of the results it produces:

Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill,

"Roary: Rapid large-scale prokaryote pan genome analysis", Bioinformatics, 2015 Nov 15;31(22):3691-3693

doi: http://doi.org/10.1093/bioinformatics/btv421

Pubmed: 26198102

Usage: roary [options] *.gff

Options: -p INT number of threads [1]

-o STR clusters output filename [clustered_proteins]

-f STR output directory [.]

-e create a multiFASTA alignment of core genes using PRANK

-n fast core gene alignment with MAFFT, use with -e

-i minimum percentage identity for blastp [95]

-cd FLOAT percentage of isolates a gene must be in to be core [99]

-qc generate QC report with Kraken

-k STR path to Kraken database for QC, use with -qc

-a check dependancies and print versions

-b STR blastp executable [blastp]

-c STR mcl executable [mcl]

-d STR mcxdeblast executable [mcxdeblast]

-g INT maximum number of clusters [50000]

-m STR makeblastdb executable [makeblastdb]

-r create R plots, requires R and ggplot2

-s dont split paralogs

-t INT translation table [11]

-ap allow paralogs in core alignment

-z dont delete intermediate files

-v verbose output to STDOUT

-w print version and exit

-y add gene inference information to spreadsheet, doesnt work with -e

-iv STR Change the MCL inflation value [1.5]

-h this help message

Example: Quickly generate a core gene alignment using 8 threads

roary -e --mafft -p 8 *.gff

For further info see: http://sanger-pathogens.github.io/Roary/

> query_pan_genome -h

# query_pan_genome -h

Usage: query_pan_genome [options] *.gff

Perform set operations on the pan genome to see the gene differences between groups of isolates.

Options: -g STR groups filename [clustered_proteins]

-a STR action (union/intersection/complement/gene_multifasta/difference) [union]

-c FLOAT percentage of isolates a gene must be in to be core [99]

-o STR output filename [pan_genome_results]

-n STR comma separated list of gene names for use with gene_multifasta action

-i STR comma separated list of filenames, comparison set one

-t STR comma separated list of filenames, comparison set two

-v verbose output to STDOUT

-h this help message

Examples:

Union of genes found in isolates

query_pan_genome -a union *.gff

Intersection of genes found in isolates (core genes)

query_pan_genome -a intersection *.gff

Complement of genes found in isolates (accessory genes)

query_pan_genome -a complement *.gff

Extract the sequence of each gene listed and create multi-FASTA files

query_pan_genome -a gene_multifasta -n gryA,mecA,abc *.gff

Gene differences between sets of isolates

query_pan_genome -a difference --input_set_one 1.gff,2.gff --input_set_two 3.gff,4.gff,5.gff

For further info see: http://sanger-pathogens.github.io/Roary/

> iterative_cdhit -h

# iterative_cdhit -h

Usage: iterative_cdhit [options]

Iteratively cluster a FASTA file of proteins with CD-hit, lower the threshold each time and extracting core genes (1 per isolate) to another file, and remove them from the input proteins file.

Required arguments:

-m STR input FASTA file of protein sequences [_combined_files]

Options: -p INT number of threads [1]

-n INT number of isolates [1]

-c STR cd-hit output filename [_clustered]

-f STR output filename for filtered sequences [_clustered_filtered.fa]

-l FLOAT lower bound percentage identity [98.0]

-u FLOAT upper bound percentage identity [99.0]

-s FLOAT step size for percentage identity [0.5]

-v verbose output to STDOUT

-h this help message

For further info see: http://sanger-pathogens.github.io/Roary/

docker imageも準備されている。

docker pull sangerpathogens/roary

ここではcondaでは導入されない結果を視覚化するスクリプトも使う。Marco Galardiniさんのコードでroaryレポジトリのcontributor/に含まれている。

依存

python (2 or 3)
Biopython
numpy
pandas
matplotlib
seaborn

#足りないものだけ導入。ここでは２つ
conda install -c bioconda -y biopython seaborn

#roaryレポジトリをとってくる
git clone https://github.com/sanger-pathogens/Roary.git

> python Roary/contrib/roary_plots/roary_plots.py

$ python Roary/contrib/roary_plots/roary_plots.py

usage: roary_plots.py [-h] [--labels] [--format {png,tiff,pdf,svg}] [-N SKIPPED_COLUMNS] [--version] tree spreadsheet

roary_plots.py: error: the following arguments are required: tree, spreadsheet

もう１つ、こちらはTorsten Seemann さんが書かれたコード。同じくcontributor/に含まれている。

> perl Roary/contrib/roary2svg/roary2svg.pl

$ perl Roary/contrib/roary2svg/roary2svg.pl

Usage: Roary/contrib/roary2svg/roary2svg.pl [options] gene_presence_absence.csv > pan_genome.svg

--help This help.

--verbose! Verbose output (default '0').

--width=i Canvas width (default '1024').

--height=i Row height (default '20').

--taxacolumn=i Column in gpa.csv where taxa begin (default '14').

--colour=s Colour of core cells (default 'DimGray').

--sepcolour=s Colour of horizontal separators/borders (default 'LightGray').

--acconly! Only draw accessory (non-core) genes (default '0').

--border! Add outline border (default '0').

実行方法

準備 - アノテーションのGFFを用意する

ランするのに必要なのは、prokkaで自動アノテーションして得たGFF３ファイル。NCBIからダウンロードしたGFF3は、遺伝子アノテーションの後にゲノム配列が付いていないため、使用することはできない。GenBank（full）をダウンロードし、bp_genbank2gff3.pl スクリプト（link）でGFF3に変換して使用する。その他の注意点として、結果の解釈しやすさの関係で、それぞれのサンプルの gene IDs（locus tag）はユニークがあることが望ましい。prokkaを走らせる時に"--locustag xxx"をつけて実行する。

１、prokka_annotation.shを実行

prokkaを使い、カレントのfastaファイルの自動アノテーション付けを行う。1ずつループ処理する例と10並列処理する例。

#1 prokka導入（インストール済みならスキップ）
mamba create -n prokka -c bioconda prokka -y
conda activate prokka

#2 prokka実行
#ゲノムファイルを１つずつアノテーション

for file in `\find *fasta -maxdepth 1 -type f`; do
 genome=${file}
 folder=${file%.fasta}
 
 prokka $genome --outdir ${folder} --addgenes --cpus 12 --quiet
 #rename
 cp ${folder}/PROKKA*.gff GFF/${folder}.gff
done

#metagenome protein
mkdir proteins
for file in `\find *fasta -maxdepth 1 -type f`; do
 genome=${file}
 folder=${file%.fasta}
 
 prokka $genome --outdir ${folder} --addgenes --cpus 12 --metagenome --quiet
 #rename
 cp ${folder}/PROKKA*.faa proteins/${folder}.faa
done


#CPUとメモリがたくさん利用できるならGNU parallelで並列化してアノテーション
#10並列。それぞれ４スレッド。最初はdry-runをつけてコマンドチェック。問題なければ消してラン。出力のgffとfaaを別のディレクトリに保存するところまで自動化。
mkdir GFF #.gff保存dir
mkdir Protein #.faa保存dir
#20 parallel
ls *.fna | parallel -j 20 --dry-run 'prokka {} --outdir {.} --addgenes --cpus 1 --quiet --locustag {.} --prefix {.} && cp {.}/*gff GFF/ && cp {.}/*faa Protein/'

#必要ならlocustagもつける。locustagが長くならないように、使用するfastaファイル名はIDだけに短くしておく
mkdir GFF #.gff保存dir
mkdir protein #.faa保存dir
ls *.fna | parallel -j 20 --dry-run 'prokka {} --outdir {.} --addgenes --cpus 1 --locustag {.} --quiet --locustag {.} --prefix {.} && cp {.}/*faa protein/ && cp {.}/*gff GFF/'

#あるいはprodigalを使う (-metaはbinningしていない生のメタゲノムアセンブリに使う)
ls *.fna | parallel -j 20 --dry-run 'prodigal -i {} -f gff -o {.}.gff -a {.}.faa'

#locus tag名をゲノムID名にしておくと、後でタンパク質を全ファイルを連結したfaaから取り出す時などに捗る。

ラン

１、roaryの実行

conda activate roary
roary -e --mafft -p 8 -r -qc -cd 99 -f output_dir *.gff

#sanple数が数百あるならgroup_limitを増やす。さらにコア遺伝子を95%のゲノムにヒットに変更。（遠いペア間での全タンパク質比較結果を見て）identityは80%にする。
roary -e --mafft -p 40 -r -qc -cd 95 --group_limit 500000 -i 80 -f output_dir *.gff

#dockerだと
docker run --rm -itv $PWD:/data/ -w /data sangerpathogens/roary
> roary -e --mafft -p 8 -r -qc -cd 99 -f output_dir *.gff

-f output directory [.]
-p number of threads [1]
-cd percentage of isolates a gene must be in to be core [99]
-r create R plots, requires R and ggplot2
-qc generate QC report with Kraken
-i minimum percentage identity for blastp [95]

-rを指定しておくと追加レポートも出力される。

f:id:kazumaxneo:20200526111539p:plain

２、高速なFastTree (manual)の実行（下の順番で指定しないとエラーになった）。

#roaryの出力ディレクトリに移動する
cd output_dir/

#Fasttree
fasttree -gtr -nt core_gene_alignment.aln > tree.newick

FastTreeは近似なので最尤法にも関わらず非常に高速だが（approximately-maximum-likelihood phylogenetic trees）、通常は他の方法も試す。ただし巨大なタンパク質群を使っているためゲノムの数が増えると相当な時間がかかる。

IQtree

#iqtree. 100 bootstrap
iqtree -s core_gene_alignment.aln -B 100 --seqtype AA -m MEP -nt 30

#iqtree2 1000 bootstrap,30 threads
iqtree2 -s core_gene_alignment.aln -m MFP -bb 1000 -alrt 1000 -nt 30

-b Specify number of bootstrap replicates (recommended >=100). This will perform both bootstrap and analysis on original alignment and provide a consensus tree.
-bb Specify number of bootstrap replicates (>=1000).
-alrt Specify number of replicates (>=1000) to perform SH-like approximate likelihood ratio test (SH-aLRT) (Guindon et al., 2010).
-T NUM|AUTO No. cores/threads or AUTO-detect
-m (モデル選択)について
#-m TESTは配列に応じて（DNA、タンパク質）の標準のモデル選択を行う、またモデル決定のあとの系統推定まで自動で行う。ここでは、v1.6以降に導入された、TESTオプションの拡張モデルであるMFPを使用。
-alrtについて；ブートストラップ解析回数を指定する-bのかわりにより高速な-alrtを使用。-alrtなどのsingle branch testsは、すべてのブートストラップ解析よりも高速で、非常に大きなデータセット（>10,000 分類群など）の場合に推奨される。

iqtreeでは"-o"でアウトグループを指定できる。

３、視覚化1 - PDF出力

#roaryの出力ディレクトリに移動する
cd output_dir/

#上で取ってきたレポジトリのスクリプトを使う。まだ取ってきてないなら
git clone https://github.com/sanger-pathogens/Roary.git

#実行
python Roary/contrib/roary_plots/roary_plots.py \
tree.newick gene_presence_absence.csv --labels --format pdf

#roary_plots.pyスクリプトを少し修正する
#tips1 ラベルされる文字数の上限が10になっている。増やすには以下の行の数値を増やす
#label_func=lambda x: str(x)[:10]
#tips2 ラベルされる文字サイズは以下のコードの通り動的に設定されるが、tipが多いと被る。
#fsize = 12 - 0.1*roary.shape[1]
#if fsize < 7:
#    fsize = 7
#この3行を以下に置き換える。サイズ3にするなら
#fsize = 3

例；192のサルモネラゲノムの視覚化結果

f:id:kazumaxneo:20200217171822p:plain

f:id:kazumaxneo:20200218194901p:plain

"A piechart of the pan genome, breaking down the core, soft core, shell and cloud." githubより

f:id:kazumaxneo:20200218194857p:plain

このデータではコア遺伝子がかなり少ないが、コア遺伝子の割合が少ないということは、それだけ進化的に距離がある集団を比較していることを意味する（アセンブル精度が低いサンプルが多く混じっている可能性もある）。

summary_statistics.txtも確認する。

４、視覚化2

#roaryの出力ディレクトリに移動する
cd output_dir/

#上で取ってきたレポジトリのスクリプトを使う。まだ取ってきてないなら
git clone https://github.com/sanger-pathogens/Roary.git

#実行
perl Roary/contrib/roary2svg/roary2svg.pl gene_presence_absence.csv > pan_genome.svg

12ゲノムの結果

f:id:kazumaxneo:20200218194335p:plain

ⅡI, query_pan_genome

コア遺伝子（その種で共通して存在）とアクセサリ遺伝子（株間で共通して存在）、サンプルに特異的な遺伝子を、共通、交差、ユニークなどに分けてリスト出力する。使用するには前もって１のroaryを実行し、clustered_proteinsを用意しておく必要がある。また、gffファイルも必要なので、gffファイルをワイルドカード指定する。*gffかclustered_proteinsのいずれかについて、カレントへのシンボリックリンクを張っておく必要がある。

#roaryの出力ディレクトリに移動する
cd output_dir/

#全ゲノムで見つかる遺伝子（core gene）
query_pan_genome -a intersection -c 100 <path>/<to>/<your>/<GFF>/*.gff -o intersect_outout

#90%以上のゲノムで見つかる遺伝子
query_pan_genome -a intersection -c 90 <path>/<to>/<your>/<GFF>/*.gff -o intersect_outout

#２ゲノム以上(≥2)で共通の遺伝子
query_pan_genome -a union <path>/<to>/<your>/<GFF>/*.gff -o union_outout

#全ゲノムには共通していない遺伝子　(accessory genes）（１回しか見つからないという意味ではない）
query_pan_genome -a complement <path>/<to>/<your>/<GFF>/*.gff -o complement_outout


#おまけ　出力から遺伝子名だけ取り出す
cut -d ":" -f 1 output > gene
#さらに改行を除く
perl -pe 's/\n/,/' gene > list

-g groups filename [clustered_proteins]
-a action (union/intersection/complement/gene_multifasta/difference) [union]
-c percentage of isolates a gene must be in to be core [99]
-o output filename [pan_genome_results]
-n comma separated list of gene names for use with gene_multifasta action
-i comma separated list of filenames, comparison set one
-t comma separated list of filenames, comparison set two
-v verbose output to STDOUT

roaryと違って該当ファイルのみ出力される。

comp_outputの例

f:id:kazumaxneo:20200214142254p:plain

query_pan_genome -a complement を走らせて５つのゲノムアノテーションを比較したところ、３ゲノムにしか見つからない遺伝子lldRがあったとする。

f:id:kazumaxneo:20200219104429p:plain

この遺伝子配列を取り出す。

使用した全てのgffを対象に”query_pan_genome -a gene_multifasta ”

（-a gene_multifastaというランモード）を走らせれば簡単に取り出せる。

該当する遺伝子のタンパク質配列を抽出

query_pan_genome -a gene_multifasta -o output -n lldR *.gff

#複数同時も可能
query_pan_genome -a gene_multifasta-o output -n gryA,mecA,abc *.gff

出力されるtest_lldR.faを開く。

> cat test_lldR.fa

>genome1_00013

MKSATSAQRPYQEVGAMIRDLIIKTPYNPGERLPPEREIAEMLDVTRTVVREALIMLEIK

GLVEVRRGAGIYVLDNSGSQNTDSPDANVCNDAGPFELLQARQLLESNIAEFAALQATRE

DIVKMRQALQLEERELASSAPGSSESGDMQFHLAIAEATHNSMLVELFRQSWQWRENNPM

WIQLHSHLDDSLYRKEWLGDHKQILAALIKKDARAAKLAMWQHLENVKQRLLEFSNVDDI

YFDGYLFDSWPLDKVDA*

>genome2_00013

MKSATSAQRPYQEVGAMIRDLIIKTPYNPGERLPPEREIAEMLDVTRTVVREALIMLEIK

GLVEVRRGAGIYVLDNSGSQNTDSPDANVCNDAGPFELLQARQLLESNIAEFAALQATRE

DIVKMRQALQLEERELASSAPGSSESGDMQFHLAIAEATHNSMLVELFRQSWQWRENNPM

WIQLHSHLDDSLYRKEWLGDHKQILAALIKKDARAAKLAMWQHLENVKQRLLEFSNVDDI

YFDGYLFDSWPLDKVDA*

>genome3_00013

MKSATSAQRPYQEVGAMIRDLIIKTPYNPGERLPPEREIAEMLDVTRTVVREALIMLEIK

GLVEVRRGAGIYVLDNSGSQNTDSPDANVCNDAGPFELLQARQLLESNIAEFAALQATRE

DIVKMRQALQLEERELASSAPGSSESGDMQFHLAIAEATHNSMLVELFRQSWQWRENNPM

WIQLHSHLDDSLYRKEWLGDHKQILAALIKKDARAAKLAMWQHLENVKQRLLEFSNVDDI

YFDGYLFDSWPLDKVDA*

複数ゲノム間で異なる遺伝子を探す。

query_pan_genome -a difference --input_set_one 1.gff,2.gff --input_set_two 3.gff,4.gff,5.gff

こちらで紹介されているツールを使うと、sqliteのデータベースを作り、パンゲノム中の指定した遺伝子配列を抽出できる。例えばコア遺伝子rpoAに着目して、全ゲノムからのrpoA遺伝子配列を抽出する。

roaryの出力はpanX、Gubbins、scoary、FriPan、PanVizGeneratorなど様々なツールの入力に指定できる。詳細はroaryのHP参照（link）。

入力が500ゲノムくらいあっても計算は可能。30スレッドで5日ほどかかった（3990X使用）。

roaryは何百のゲノムでもスケールし、下流の様々なツールとも連携する素晴らしいツールですが、デフォルト設定ではではblastの95％で線を引くため、大腸菌やサルモネラゲノム間のように非常によく似たゲノム間は良いとしても、種間レベルまで距離が出ると感度が大きく下がります。その場合は感度を上げるか、他のパラメータも設定できる別のツールも走らせて結果を比較するなどして注意して使って下さい。もし感度が低いとなったらGET_HOMOLOGUESを試すことをお勧めします。

引用

Roary: rapid large-scale prokaryote pan genome analysis
Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T.G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill
Bioinformatics, Volume 31, Issue 22, 15 November 2015, Pages 3691–3693

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

パンゲノム解析を行う roary