前回紹介しましたPPanGGOLiNのグラフですが、PPanGGOLiNはHDF-5というファイルフォーマットでパンゲノムを作成し、管理しています。出力ディレクトリにある.h5ファイルがこれに相当します。このファイルは、パンゲノム解析の結果を関連するパラメータや完全なパンゲノムコンテンツとともに記録する中央リポジトリの役割を果たしている、とレポジトリで書かれています。PPanGGOLiNではこのファイルを指定することで、新しいゲノムを追加したり、メタデータを追加し、それから再びMSAやグラフファイルを書き出すことが可能になっています*1。ここではPPanGGOLiNのパンゲノム解析結果にメタデータを追加し、再びグラフファイルに書き出す手順を確認します。
インストール
ubuntu22.04LTSで環境を作って最新のv2.1.2をインストールした。
本体 Github
#bioconda (link)
#v2.1.2
mamba create -n ppanggolin-env212 python=3.12 -y
conda activate ppanggolin-env212
mamba install -c bioconda ppanggolin=2.1.2 -y
> ppanggolin metadata
usage: ppanggolin metadata [-h] [-p PANGENOME] [-m [METADATA]] [-s [SOURCE]] [-a [{families,genomes,contigs,genes,RGPs,spots,modules}]] [--omit] [--verbose {0,1,2}] [--log LOG] [-d] [-f] [--config CONFIG]
Required arguments:
All of the following arguments are required :
-p PANGENOME, --pangenome PANGENOME
The pangenome .h5 file
-m [METADATA], --metadata [METADATA]
Metadata in TSV file. See our github for more detail about format
-s [SOURCE], --source [SOURCE]
Name of the metadata source
-a [{families,genomes,contigs,genes,RGPs,spots,modules}], --assign [{families,genomes,contigs,genes,RGPs,spots,modules}]
Select to which pangenome element metadata will be assigned
Optional arguments:
--omit Allow to pass if a key in metadata is not find in pangenome
Common arguments:
-h, --help show this help message and exit
--verbose {0,1,2} Indicate verbose level (0 for warning and errors only, 1 for info, 2 for debug)
--log LOG log output file
-d, --disable_prog_bar
disables the progress bars
-f, --force Force writing in output directory and in pangenome output file.
--config CONFIG Specify command arguments through a YAML configuration file.
PPanGGOLiN (2.1.2) is an opensource bioinformatic tools, developed by the LABEGeM team, under CeCILL FREE SOFTWARE LICENSE AGREEMENT
For pangenome analyses, please cite:
Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph.
PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732
For genomic islands and spots of insertion detection, please cite:
Bazin et al., panRGP: a pangenome-based method to predict genomic islands and explore their diversity,
Bioinformatics, Volume 36, Issue Supplement_2, December 2020, Pages i651–i658, https://doi.org/10.1093/bioinformatics/btaa792
For module prediction, please cite:
Bazin et al., panModule: detecting conserved modules in the variable regions of a pangenome graph.
biorxiv. https://doi.org/10.1101/2021.12.06.471380
> ppanggolin write_metadata
usage: ppanggolin write_metadata [-h] [-p PANGENOME] -o OUTPUT [--compress] [-e {modules,genomes,spots,genes,contigs,RGPs,families} [{modules,genomes,spots,genes,contigs,RGPs,families} ...]] [-s METADATA_SOURCES [METADATA_SOURCES ...]] [--verbose {0,1,2}] [--log LOG] [-d] [-f] [--config CONFIG]
Required arguments:
One of the following arguments is required :
-p PANGENOME, --pangenome PANGENOME
The pangenome .h5 file
-o OUTPUT, --output OUTPUT
Output directory where the file(s) will be written
Optional arguments:
--compress Compress the files in .gz
-e {modules,genomes,spots,genes,contigs,RGPs,families} [{modules,genomes,spots,genes,contigs,RGPs,families} ...], --pangenome_elements {modules,genomes,spots,genes,contigs,RGPs,families} [{modules,genomes,spots,genes,contigs,RGPs,families} ...]
Specify pangenome elements for which to write metadata. default is all element with metadata.
-s METADATA_SOURCES [METADATA_SOURCES ...], --metadata_sources METADATA_SOURCES [METADATA_SOURCES ...]
Which source of metadata should be written. By default all metadata sources are included.
Common arguments:
-h, --help show this help message and exit
--verbose {0,1,2} Indicate verbose level (0 for warning and errors only, 1 for info, 2 for debug)
--log LOG log output file
-d, --disable_prog_bar
disables the progress bars
-f, --force Force writing in output directory and in pangenome output file.
--config CONFIG Specify command arguments through a YAML configuration file.
PPanGGOLiN (2.1.2) is an opensource bioinformatic tools, developed by the LABEGeM team, under CeCILL FREE SOFTWARE LICENSE AGREEMENT
For pangenome analyses, please cite:
Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph.
PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732
For genomic islands and spots of insertion detection, please cite:
Bazin et al., panRGP: a pangenome-based method to predict genomic islands and explore their diversity,
Bioinformatics, Volume 36, Issue Supplement_2, December 2020, Pages i651–i658, https://doi.org/10.1093/bioinformatics/btaa792
For module prediction, please cite:
Bazin et al., panModule: detecting conserved modules in the variable regions of a pangenome graph.
biorxiv. https://doi.org/10.1101/2021.12.06.471380
> ppanggolin metrics
usage: ppanggolin metrics [-h] [-p PANGENOME] [--genome_fluidity] [--no_print_info] [--recompute_metrics] [--verbose {0,1,2}] [--log LOG] [-d] [-f] [--config CONFIG]
Required arguments:
Specify the required argument:
-p PANGENOME, --pangenome PANGENOME
Path to the pangenome .h5 file
Input file:
Choose one of the following arguments:
--genome_fluidity Compute the pangenome genomic fluidity.
Optional arguments:
Specify optional arguments with default values:
--no_print_info Suppress printing the metrics result. Metrics are saved in the pangenome and viewable using 'ppanggolin info'.
--recompute_metrics Force re-computation of metrics if already computed.
Common arguments:
-h, --help show this help message and exit
--verbose {0,1,2} Indicate verbose level (0 for warning and errors only, 1 for info, 2 for debug)
--log LOG log output file
-d, --disable_prog_bar
disables the progress bars
-f, --force Force writing in output directory and in pangenome output file.
--config CONFIG Specify command arguments through a YAML configuration file.
PPanGGOLiN (2.1.2) is an opensource bioinformatic tools, developed by the LABEGeM team, under CeCILL FREE SOFTWARE LICENSE AGREEMENT
For pangenome analyses, please cite:
Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph.
PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732
For genomic islands and spots of insertion detection, please cite:
Bazin et al., panRGP: a pangenome-based method to predict genomic islands and explore their diversity,
Bioinformatics, Volume 36, Issue Supplement_2, December 2020, Pages i651–i658, https://doi.org/10.1093/bioinformatics/btaa792
For module prediction, please cite:
Bazin et al., panModule: detecting conserved modules in the variable regions of a pangenome graph.
biorxiv. https://doi.org/10.1101/2021.12.06.471380
> ppanggolin info -h
usage: ppanggolin info [-h] -p PANGENOME [-a] [-c] [-s] [-m]
options:
-h, --help show this help message and exit
Required arguments:
Specify the following required argument:
-p PANGENOME, --pangenome PANGENOME
Path to the pangenome .h5 file
Information Display Options (default: all):
-a, --parameters Display the parameters used or computed for each step of pangenome generation
-c, --content Display detailed information about the pangenome's content
-s, --status Display information about the statuses of different elements in the pangenome, indicating what has been computed or not
-m, --metadata Display a summary of the metadata saved in the pangenome
> ppanggolin write_pangenome
$ ppanggolin write_pangenome
usage: ppanggolin write_pangenome [-h] [-p PANGENOME] -o OUTPUT [--soft_core SOFT_CORE] [--dup_margin DUP_MARGIN] [--gexf] [--light_gexf] [--json] [--csv] [--Rtab] [--stats] [--partitions] [--families_tsv] [--regions] [--spots] [--borders] [--modules] [--spot_modules] [--compress] [-c CPU]
[--verbose {0,1,2}] [--log LOG] [-d] [-f] [--config CONFIG]
Required arguments:
One of the following arguments is required :
-p PANGENOME, --pangenome PANGENOME
The pangenome .h5 file
-o OUTPUT, --output OUTPUT
Output directory where the file(s) will be written
Optional arguments:
--soft_core SOFT_CORE
Soft core threshold to use
minimum ratio of genomes in which the family must have multiple genes for it to be considered 'duplicated'
--gexf write a gexf file with all the annotations and all the genes of each gene family
--light_gexf write a gexf file with the gene families and basic information about them
--json Writes the graph in a json file format
--csv csv file format as used by Roary, among others. The alternative gene ID will be the partition, if there is one
--Rtab tabular file for the gene binary presence absence matrix
--stats tsv files with some statistics for each each gene family
--partitions list of families belonging to each partition, with one file per partitions and one family per line
--families_tsv Write a tsv file providing the association between genes and gene families
--regions Writes the predicted RGP and descriptive metrics in 'plastic_regions.tsv'
--spots Write spot summary and a list of all RGP in each spot
--borders List all borders of each spot
--modules Write a tsv file listing functional modules and the families that belong to them
--spot_modules writes 2 files comparing the presence of modules within spots
--compress Compress the files in .gz
-c CPU, --cpu CPU Number of available cpus
Common arguments:
-h, --help show this help message and exit
--verbose {0,1,2} Indicate verbose level (0 for warning and errors only, 1 for info, 2 for debug)
--log LOG log output file
-d, --disable_prog_bar
disables the progress bars
-f, --force Force writing in output directory and in pangenome output file.
--config CONFIG Specify command arguments through a YAML configuration file.
PPanGGOLiN (2.1.2) is an opensource bioinformatic tools, developed by the LABEGeM team, under CeCILL FREE SOFTWARE LICENSE AGREEMENT
For pangenome analyses, please cite:
Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph.
PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732
For genomic islands and spots of insertion detection, please cite:
Bazin et al., panRGP: a pangenome-based method to predict genomic islands and explore their diversity,
Bioinformatics, Volume 36, Issue Supplement_2, December 2020, Pages i651–i658, https://doi.org/10.1093/bioinformatics/btaa792
For module prediction, please cite:
Bazin et al., panModule: detecting conserved modules in the variable regions of a pangenome graph.
biorxiv. https://doi.org/10.1101/2021.12.06.471380
1、ゲノムの準備
昨日使った3つのゲノム;10kbの短い模擬ゲノム1、模擬ゲノム1のORFの1つを欠落させた模擬ゲノム2、模擬ゲノム1の3'末端近くに2つのORFをコードするゲノム領域を挿入させたゲノムを使います。3つのゲノムのORFのシンテニーをclinker(紹介)を使って視覚化すると以下のようになります。
#clinkerの実行(prodigalのgbkファイルを使用)
clinker *gbk -p putput.html

基準の模擬ゲノム1を真ん中に配置しています。同じ色は同じORFを表します。3つのゲノム全てに存在するのがコア遺伝子、そうではないのがアクセサリ遺伝子です。
2、パンゲノム解析
ppanggolin allコマンドを使ってパンゲノム解析を行います。名前とfastaファイルのパスを記載したTSVを指定します。

#20スレッド指定
ppanggolin all --fasta list.tsv -c 20 -o outdir
cp outdir/pangenome.h5 .
出力

(写真では小さいが、ゲノムが増えると.h5とjsonファイルのサイズはかなり巨大になる。数千個のゲノムを扱う際はストレージに注意したい)
===============================================================================
補足
ppanggolin infoコマンドを--metadataを付けて実行すると、組み込まれているメタデータを確認できます。
ppanggolin info -p pangenome.h5 --metadata
![]()
nullで、この模擬ゲノム3つのHDF-5にはメタデータは含まれていないことが確認できます(wikiによると、初期状態の.h5でもゲノムアノテーションファイルから取り込まれたメタデータが入っている場合があるようです)。
ppanggolin metricsサブコマンドを使うとパンゲノムのメトリクスがプリントされます。
ppanggolin metrics -p pangenome.h5 --genome_fluidity
- --genome_fluidity Compute the pangenome genomic fluidity.

===============================================================================
3,メタデータの追加と書き出し
https://ppanggolin.readthedocs.io/en/dev/user/metadata.html
メタデータはTSV形式で用意します。柔軟な形式に対応してますが、組み込めるのはpangenomeの.h5ty中のfamilies,genomes,contigs,genes,RGPs,spots,modules属性のいずかなので、どの属性に組み込むデータであるかを --assignで指定します。また --assign で指定した名前がTSVファイルの一行目に存在している必要があります(ここでは一行目の"isolation")。
metadata.tsv

(ここではgenomes属性にメタデータを組み込むが、グラフ上でメタデータを指定するには、ノードであるgenesかfamilies属性などにメタデータを組み込む必要がある。)
用意したメタデータを.h5に組み込みます。.h5のgenomes属性にTSVファイルのisolation情報を組み込みます。-sで取り込まれたあとのソースデータ名(ストレージキーとなるユーザーが決めた名前)を指定します。
ppanggolin metadata -p pangenome.h5 --metadata metadata.tsv -s isolation --assign genomes
- -p, --pangenome The pangenome .h5 file
- -m, --metadata Metadata in TSV file. See our github for more detail about format
- -s, --source Name of the metadata source
- -a {families,genomes,contigs,genes,RGPs,spots,modules}, --assign {families,genomes,contigs,genes,RGPs,spots,modules} Select to which pangenome element metadata will be assigned
ppanggolin write_metadataコマンドを使うと、組み込まれたメタデータをファイルに書き出すことができます。
ppanggolin write_metadata -p pangenome.h5 --output outdir
メタデータごとに別々のTSVに書き出される。
===============================================================================
例
1、.h5の"genomes"属性にソース名"abc"で上で示したメタデータを取り込んだ。
$ ppanggolin metadata -p pangenome.h5 --metadata metadata.tsv -s abc --assign genomes
2、それからTSV書き出しした。
> cat export1/genomes_metadata_from_abc.tsv

===============================================================================
4、グラフファイルの書き出し
https://ppanggolin.readthedocs.io/en/dev/user/PangenomeAnalyses/pangenomeGraphOut.html
メタデータが組み込まれた.h5から再びグラフを書き出すことができる。
ppanggolin write_pangenome -p pangenome.h5 --light_gexf -o export_dir
- --gexf write a gexf file with all the annotations and all the genes of each gene family
- --light_gexf write a gexf file with the gene families and basic information about them
指定した出力ディレクトリに、"--gexf"付きだと通常のpangenomeGraph.gexfが、--light_gexfありだと軽量なpangenomeGraph_light.gexfが保存されます。
引用
panModule: detecting conserved modules in the variable regions of a pangenome graph
Adelme Bazin, Claudine Medigue, David Vallenet, Alexandra Calteau
bioRxiv, Posted December 07, 2021.
Clinker & clustermap.js: Automatic generation of gene cluster comparison figures
Cameron L M Gilchrist, Yit-Heng Chooi
Bioinformatics, Published: 18 January 2021
*1 ViTables(link)を使うと、GUI環境で閲覧・書き換えすることも可能なようです。