2021-09-08

大規模な微生物の比較ゲノミクスのためのモジュラー式のツール PanACoTA

2021 9/8 修正

2021 9/9 誤字修正

2021 9/10 prokkaのバージョンによるエラー修正 (依存するライブラリの関係でpython3.7の環境に導入するように修正した), --prodigalのオプション消去

2021 10/15 docker imageのコマンド追記

2021 10/18 annotate追記

2021 10/28 pangenomeコマンド追記

　低コストのシークエンシングと数十万のゲノムが利用可能なことから、比較ゲノムは多くの微生物学者、遺伝学者、進化生物学者の基本的なツールキットとなっている。現在、多くの細菌種においてGenBank RefSeqリファレンスデータベースで100以上のゲノムが公開されており、中には1万以上のゲノムが公開されているものもある。この傾向は、シーケンシングにかかるコストの低下、ロングロード技術の利用可能性、診断や疫学のためのクリニックでの全ゲノムシーケンシングの利用などが進むにつれ、さらに増加すると考えられる。その結果、利用可能なデータを利用しようとする研究者は、解析すべき非常に大量のデータに直面している。比較ゲノム解析は、過去20年の間に細菌ゲノムの組織と進化の理解に重要な貢献をしてきた。比較ゲノムは疫学研究の標準的なツールとなっており、一連の菌株に共通する遺伝子（コアゲノムまたはpersistentゲノム(*１　下に補足説明)）を解析することで、関心のあるクローンの拡大を追跡する際に比類のない精度が得られる。診療での日常的なシークエンシングの使用には、単一の種から数千、近い将来には数百万のゲノムを検索するための迅速で信頼性の高い解析ツールがさらに必要になるだろう。集団遺伝学もまた、このような豊富なデータの恩恵を受けている。なぜなら、遺伝的に獲得した突然変異の起源と運命を詳細に追跡して、それが適応過程や突然変異過程の何を明らかにしているかを理解できるからである。最後に、ゲノムワイドな関連性研究は、一塩基多型や遺伝子レパートリーの変異を説明するために、最近、細菌遺伝学にも適用されている。これらの研究は、生物学者が関心のある表現型の遺伝的基盤を特定するのに役立つことが期待されている。細菌ゲノムにおける高い遺伝的連結性を考えると、これらの研究では、小さな影響を検出するために非常に大規模なデータセットを必要とする場合がある。より具体的には、reverse vaccinologyもまた、これらのパンゲノミクス手法の注目すべき応用であり、特定のクレードのコア表面露出タンパク質の中から新規の潜在的な抗原を同定するためのものである。

　大規模なゲノムデータセットの利用可能性は、研究者、特にバイオインフォマティクスの広範なトレーニングを受けていない研究者に大きな負担をかけている。最近では、類似度の高い配列に対して優れた想起率で配列の類似性を迅速に検索するためのツールが数多く開発されている。

　また、他のツールでは、配列類似度の高いファミリーに大量の配列を迅速にクラスタリングしたり、ゲノムのセットに共通するファミリーを取得したり、それらをアラインメントしたり、系統樹を作成したりする方法も提供されており、これは比較ゲノムの礎となっている。最近では、細菌のパンゲノムを計算するためのこれらのツールのいくつかを含む多くのプログラムが発表されている。これらのプログラムの多くは、非常に高速なプログラムを使用して、ファミリーのアラインメントやクラスタを計算する。いくつかは、DIAMOND、USEARCHおよびCD-HITのように、非常に高速であるために精度を犠牲にすることが知られているツールを使用している。後者は、現在パンゲノムを計算するための最もポピュラーなツールであるRoaryや、原核生物ゲノムの誤った自動アノテーションの影響を軽減することを目的とした非常に最近のツールであるPanarooなどによって使用されている。BPGAもまた、USEARCHやCD-HITを用いてタンパク質をクラスタリングし、いくつかの下流の解析を提供している。また、優れたグラフィカルインターフェースを持つPanXは、DIAMONDを用いて遺伝子間の類似性を検索する。

　さらに最近では、SonicParanoidはパンゲノムを構築するために高効率で正確なプログラムmmseqs2の使用を導入し、PPanGGOLiNは同じツールを使用してパンゲノムファミリーを頻度の観点から統計的に分類する方法を提供した。最近のプログラムの中には、PPanGGOLiNやPanarooのように、パンゲノムをさらに洗練させるためにグラフベースのアプローチを使用しているものもある。これに関しては、両ツールによる319個のKlebsiella pneumoniaeゲノムのデータセットの解析で、非常に類似した結果が得られている[ref.16]。PIRATEのような、遠いゲノム間のオルソログをクラスタリングするためのツールも開発された。しかし、これらのプログラムはすべて、ダウンロード、品質管理、アラインメント、系統推論など、比較ゲノムに不可欠な初期ステップと最終ステップの一部またはすべてを欠いている。これがPanACoTA（PANgenome with Annotations, COre identification, Tree and corresponding Alignments）の開発に拍車をかけた。公開されている膨大な量のゲノム情報を活用するためには、6つの主要な操作ブロックが必要である。(1) 特定のクレードのゲノムを自動的に収集する。コンティグの数が多すぎるドラフトゲノムを避けるために、ある程度の品質管理が必要である。また、計算コストや擬似複製によるバイアスを最小限に抑えるために、ゲノムが冗長になりすぎていないかどうかをチェックしておくと便利なことが多い。一方で、細菌種（または関連性のある分類学的組織）の観点から誤分類されたゲノムを排除するために、ゲノムがあまりにも無関係ではないことをチェックすることが重要である。(2) 先験的に統一された命名法とアノテーションを定義しておく。(3) ゲノムアセンブリ中の各遺伝子ファミリーの有無のパターンの行列であるパンゲノムを、正確、簡便、高速な方法で作成する。(4) パンゲノムを用いて、コア遺伝子やpersistentな遺伝子の集合を同定する。(5) コアまたはpersistentなゲノムの遺伝子ファミリーの複数のアラインメントを作成する。(6) 最後に、コア/persistent遺伝子のセットの系統を、合理的に正確に迅速に作成する。これらの4つのデータ、パンゲノム、コアゲノム、アラインメント、系統樹は、ほとんどの微生物比較ゲノム研究の基礎となっている。このプロセスの最後には、研究者は、関心のある問題に特化した、より詳細な解析を行うことができ、多くの場合、分類群の追加/除外、配列類似度の限界の変更、アラインメントの精度の向上、または異なる方法を用いた系統樹の再構築などの変更につながる。このような再定義は、パイプラインがモジュール化されており、プロセスのいくつかの重要なポイントで解析を再開始できるようになっていれば、より効率的に行うことができる。

　微生物比較ゲノミクスのためのパイプラインが現在利用可能であることを考慮して、モジュール化されており、セットアップが容易で、最先端のツールを使用し、中間結果の再利用が簡単にできるパイプラインを構築した。目標は、ある分類群からすべてのゲノムをダウンロードし、基本的な比較ゲノムを自動的に動作させることができるパイプラインを提供することだった。パイプラインは完全に単一言語であるPython v3で構築されており、将来のメンテナンスを容易にし、不要な動作を制限するために最新の方法を使用している。PanACoTAはオープンソースのGNU AGPLライセンスで自由に利用できる。ここでは、Klebsiella pneumoniaeの225個の完全長ゲノムと3980個の完全長ゲノムまたはドラフトゲノムの2つのデータセットの解析を用いて、この方法を説明する。この種は、多くのゲノムが利用可能であり、非常にオープンなパンゲノムであるため、著者らの目的には興味深い種である。最初のデータセットは、通常は配列の品質が高い場合を示しており、2番目のデータセットは、配列の品質が低かったり、完全なアセンブリがないために遺伝子が断片化されていたりするような非常に大規模なデータセットに対して、この方法がどのようにスケールアップしていくかを示している。手法の詳細は「方法」のセクションに、その使用方法と、2つのデータセットの主要なオプションとの関係でどのように変化するかの説明は「結果」のセクションに詳しく書かれている。

　PanACoTaは、以下の6つのセクションで説明する6つのシーケンシャルモジュールで実装され、以前のモジュールを必要とせずにモジュールを使用できるように設計されている。これにより、任意のステップで解析を開始または停止し、他のパラメータで解析を再実行することができる（論文図1参照）。

Documentation

https://aperrin.pages.pasteur.fr/pipeline_annotation/html-doc/

インストール

condaでpython3.9python3.7の仮想環境を作ったあと、pipで導入した（ubuntu18.04）。他のツールはcondaで導入した（注意；アドホックなやり方に過ぎず、インストールできるかどうか再現性がないかもしれません）。

Github

#conda(link)　ここではcondaの代わりにmambaを使用
mamba create -n panacota python=3.7 -y
conda activate panacota
#PanACoTAのインストール
#mamba install -c bioconda panacota -y
#prokka,mmseqs2,mashも必要（prokkaのバージョンによるエラーが報告されているので注意#9）
mamba install -c bioconda prokka==1.14.5 -y
mamba install -c bioconda mmseqs2 -y
mamba install -c bioconda mash -y
#iqtreeはversion1系をインストールする
mamba install -c bioconda iqtree=1.6 -y

#fasttree
mamba install -c bioconda/label/cf201901 fasttree -y

#PanACoTAはpipでもインストール可能（condaとおなじで依存するツールは別に導入する）
pip install panacota

#dockerとsingularityにも対応している
#docker(link)
docker pull gempasteur/panacota

> PanACoTA -h

$ PanACoTA -h

usage: PanACoTA [-h] [-V]

{all,prepare,annotate,pangenome,corepers,align,tree} ...

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

positional arguments:

{all,prepare,annotate,pangenome,corepers,align,tree}

all Run all PanACoTA modules

prepare Prepare draft dataset

annotate Quality control and annotation of genomes

pangenome Generate a pan-genome of your dataset

corepers Compute a Core or Persistent genome of your dataset

align Align Core/Persistent families

tree Infer phylogenetic tree based on core/persistent genome

optional arguments:

-h, --help show this help message and exit

-V, --version Print the version number and exit

For more details, see PanACoTA documentation.

> PanACoTA prepare -h

# PanACoTA prepare

usage: PanACoTA [-h] [-V] {all,prepare,annotate,pangenome,corepers,align,tree} ...

PanACoTA: error: As you did not put the '--norefseq' nor the '-M' option, it means that you want to download refseq (or genbank) genomes. But you did not provide any information, so PanACoTA cannot guess which species you want to download. Specify NCBI_taxid (-t), and/or NCBI species taxid (-T) and/or NCBI_species (-g) to download, or add one of the 2 options (--norefseq or -M) if you want to skip the 'download step'.

(base) root@81a7aab948d6:/home# PanACoTA prepare -h

usage: PanACoTA prepare [-T NCBI_SPECIES_TAXID] [-t NCBI_TAXID] [-g NCBI_SPECIES_NAME] [-s {refseq,genbank}] [-l LEVELS] [-o OUTDIR] [--tmp TMP_DIR] [--cutn CUTN] [--l90 L90] [--nbcont NBCONT] [--min_dist MIN_DIST]

[--max_dist MAX_DIST] [-p PARALLEL] [--norefseq] [-d DB_DIR] [-M] [--info INFO_FILE] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Prepare draft dataset

General arguments:

-T NCBI_SPECIES_TAXID

Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. This will download all sequences of this species and all its sub-species.A comma-separated list of species

taxids can also be provided. (Ex: -T 573 for Klebsiella pneumoniae)

-t NCBI_TAXID Taxid to download. This can be the taxid of a sub-species, or of a specific strain. A taxid of a subspecies will download all strains in this subspecies EXCEPT the ones which have a specific

taxid.A comma-separated list of taxids can also be provided.Ex: '-t 72407' will download all 'general' K. pneumoniae subsp. pneumoniae strains, and '-t 1123862' will download the strain K.

pneumoniae subsp. pneumoniae Kp13 (not included in -t 72407, as it is a strain of the subspecies with a specific taxid).

-g NCBI_SPECIES_NAME Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")

-s {refseq,genbank} NCBI section to download: all genbank, or only refseq (default)

-l LEVELS, --assembly_level LEVELS

Assembly levels of genomes to download (default: all). Possible levels are: 'all', 'complete', 'chromosome', 'scaffold', 'contig'.You can also provide a comma-separated list of assembly levels.

For ex: 'complete,chromosome'

-o OUTDIR Give the path to the directory where you want to save the downloaded database. In the given directory, it will create a folder 'Database_init' containing all fasta files that will be sent to the

control procedure, as well as a folder 'refseq' with all original compressed files downloaded from refseq. By default, this output dir name is the ncbi_species name if given, or ncbi_species_taxid

or ncbi_taxid otherwise.

--tmp TMP_DIR Specify where the temporary files (sequence split by rows of 'N', sequence with new contig names etc.) must be saved. By default, it will be saved in your out_dir/tmp_files.

--cutn CUTN By default, each genome will be cut into new contigs when at least 5 'N' in a row are found in its sequence. If you don't want to cut genomes into new contigs when there are rows of 'N', put 0 to

this option. If you want to cut from a different number of 'N' in a row, put this value to this option.

--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.

--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.

--min_dist MIN_DIST By default, genomes whose distance to the reference is not between 1e-4 and 0.06 are discarded. You can specify your own lower limit (instead of 1e-4) with this option.

--max_dist MAX_DIST By default, genomes whose distance to the reference is not between 1e-4 and 0.06 are discarded. You can specify your own lower limit (instead of 0.06) with this option.

-p PARALLEL, --threads PARALLEL

Run 'N' downloads in parallel (default=1). Put 0 if you want to use all cores of your computer.

Alternatives:

--norefseq If you already downloaded refseq genomes and do not want to check them, add this option to directly go to the next steps:quality control (L90, number of contigs...) and mash filter. Don't forget

to specify the db_dir (-d option) where you already have your genome sequences.

-d DB_DIR If your already downloaded sequences are not in the default directory (outdir/Database_init), you can specify here the path to those fasta files.

-M, --only-mash Add this option if you already downloaded complete and refseq genomes, and ran quality control (you have, a file containing all genome names, as well as their genome size, number of contigs and

L90 values). It will then get information on genomes quality from this file, and run mash steps.

--info INFO_FILE If you already ran the quality control, specify from which file PanACoTA can read this information, in order to proceed to the mash step. This file must contain at least 4 columns, tab separated,

with the following headers: 'to_annotate', 'gsize', 'nb_conts', 'L90'. Any other column will be ignored.

Others:

-v, --verbose Increase verbosity in stdout/stderr and log files.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA annotate -h

# PanACoTA annotate -h

usage: PanACoTA annotate [-d DB_PATH] -r RES_PATH [-l LIST_FILE] [-n NAME] [-Q] [--info FROM_INFO] [--prodigal] [--small] [--l90 L90] [--nbcont NBCONT] [--cutn CUTN] [--date DATE] [--tmp TMPDIR] [--annot_dir ANNOTDIR]

[-F] [--threads THREADS] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Quality control and annotation of genomes

Required arguments:

-d DB_PATH Path to folder containing all multifasta genome files

-r RES_PATH Path to folder where output annotated genomes must be saved

Optional arguments:

-l LIST_FILE File containing the list of genome filenames to annotate (1 genome per line). Each genome must be in multi-fasta format. You can specify the species name (4 characters) you want to give to some

genome, as well as the download (or any other reason) date of your choice. Format 'genome_name :: name.date'. name and date are optional. See doc for more information on this file format. If you

want to run this module from 'prepare_module' results, use '--info' instead.

-n NAME Choose a name for your annotated genomes. This name should contain 4 alphanumeric characters. Generally, they correspond to the 2 first letters of genus, and 2 first letters of species, e.g. ESCO

for Escherichia Coli.

-Q Add this option if you want only to do quality control on your genomes (cut at 5N if asked, calculate L90 and number of contigs and plot their distributions). This allows you to check which

genomes would be annotated with the given parameters, and to modify those parameters if you want, before you launch the annotation and formatting steps.

--info FROM_INFO If you already ran the 'prepare' data module, or already calculated yourself the L90 and number of contigs for each genome, you can give this information, to go directly to annotation and

formatting steps. This file contains at least 4 columns, tab separated, with the following headers: 'to_annotate', 'gsize', 'nb_conts', 'L90'. Any other column will be ignored.

--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.

--small If you use Prodigal to annotate genomes, if you sequences are too small (less than 20000 characters), it cannot annotate them with the default options. Add this option to use 'meta' procedure.

--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.

--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.

this option. If you want to cut from a different number of 'N' occurrences, put this value to this option.

--date DATE Specify the date (MMYY) to give to your annotated genomes. By default, will give today's date. The only requirement on the given date is that it is 4 characters long. You can use letters if you

want. But the common way is to use 4 digits, corresponding to MMYY.

--tmp TMPDIR Specify where the temporary files (sequence split by rows of 'N', sequence with new contig names etc.) must be saved. By default, it will be saved in your result_directory/tmp_files.

--annot_dir ANNOTDIR Specify in which directory the prokka/prodigal output files (1 folder per genome, called <genome_name>-[prokka, Prodigal]Res) must be saved. By default, they are saved in the same directory as

your temporary files (see --tmp option to change it).

-F, --force Force run: Add this option if you want to (re)run annotation and formatting steps for all genomes even if their result folder (for annotation step) or files (for format step) already exist:

override existing results. Without this option, if there already are results in the given result folder, the program stops. If there are no results, but prokka/prodigal folder already exists,

prokka/prodigal won't run again, and the formating step will use the already existing folder if correct, or skip the genome if there are problems in prokka/prodigal folder.

--threads THREADS Specify how many threads can be used (default=1)

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA pangenome -h

# PanACoTA pangenome -h

usage: PanACoTA pangenome -l LSTINFO_FILE -n DATASET_NAME -d DBPATH -o OUTDIR [-i MIN_ID] [-f OUTFILE] [-c {0,1,2}] [-s SPEDIR] [--threads THREADS] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Generate a pan-genome of your dataset

Required arguments:

-l LSTINFO_FILE File containing the list of all genomes to include in the pan-genome, 1 genome per line: it can be the LSTINFO-<list_file>.lst file of 'PanACoTA annotate' module.Here, only the first column (genome

name without extension) will be used. All proteins of all these genomes will be concatenated in a file called <dataset_name>.All.prt. The column header must be 'gembase_name'.

-n DATASET_NAME Name of the dataset which will be clustered (for example, SAEN1234 for 1234 Salmonella enterica genomes). This name will be used to name the protein databank, a well as the pangenome files.

-d DBPATH Path to the folder containing all protein files corresponding to the genomes of the dataset (output directory 'Proteins' of 'PanACoTA annotate' module).

-o OUTDIR Output directory, where all results must be saved (including tmp folder)

Optional arguments:

-i MIN_ID Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.

-f OUTFILE Use this option if you want to give the name of the pangenome output file (without path). Otherwise, by default, it is called PanGenome-

mmseq_<given_dataset_name>.All.prt_<information_on_parameters>.lst

-c {0,1,2} Choose the clustering mode: 0 for 'set cover', 1 for 'single-linkage', 2 for 'CD-Hit'. Default is 'single-linkage' (1)

-s SPEDIR use this option if you want to save the concatenated protein databank in another directory than the one containing all individual protein files ('Proteins' folder).

--threads THREADS add this option if you want to parallelize on several threads. Indicate on how many threads you want to parallelize. By default, it uses 1 thread. Put 0 if you want to use all threads of your

computer.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA corepers -h

# PanACoTA corepers -h

usage: PanACoTA corepers -p PANGENOME -o OUTPUTDIR [-t TOL] [-M] [-X] [-F] [-l LSTINFO_FILE] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Compute a Core or Persistent genome of your dataset

Required arguments:

-p PANGENOME PanGenome file (1 line per family, first column is fam number)

-o OUTPUTDIR Specify the output directory for your core/persistent genome.

Optional arguments:

-t TOL, --tol TOL min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the minimum number of genomes

will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.

-M Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to allow exactly 1 member in

'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not compatible.

-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is not compatible with -M

(which is allowing multigenic families: having several members in any number of genomes).

-F When you specify the '-tol' option, with a number lower than 1, you can add this option to use floor('tol'*N) as a minimum number of genomes instead of ceil('tol'*N) which is the default behavior.

-l LSTINFO_FILE By default, the core/persistent genome will include all genomes found in the given pangenome file. If you want to do a core/persistent genome on a subset of those genomes, give a file with this list

of genomes. This file must have 1 line per genome, only the first column (genome name without extension) will be used.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA align -h

# PanACoTA align -h

usage: PanACoTA align -c COREPERS -l LIST_GENOMES -n DATASET_NAME -d DBPATH -o OUTDIR [--threads THREADS] [-F] [-P] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Align Core/Persistent families

Required arguments:

-c COREPERS Core or persistent genome whose families must be aligned.

-l LIST_GENOMES File containing the list of all the genomes you want to align from their core/persistent families. 1 genome per line: it can be the LSTINFO-<list_file>.lst file of 'PanACoTA annotate' module. Here,

only the first column (genome name without extension) will be used. The final alignment file will contain 1 alignment per genome in this file.

-n DATASET_NAME Name of the dataset which will be aligned (for example, SAEN1234 for 1234 Salmonella enterica genomes). This name will be used to name the alignment file.

-d DBPATH Path to the folder containing the directories 'Proteins' and 'Genes', created by 'PanACoTA annotate'.

-o OUTDIR Output directory, where all results must be saved

Optional arguments:

computer.

-F, --force Force run: Add this option if you want to redo all alignments for all families, even if their result file already exists. Without this option, if an alignment file already exists, it will be used for

the next step. If you want to redo only a given alignment, just delete its file, without using this option.

-P Add this option if you also need the aa alignment of the concatenation of all persistent proteins. By default, PanACoTA only gives the nucleic alignment.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA tree -h

# PanACoTA tree -h

usage: PanACoTA tree -a ALIGNMENT -o OUTDIR [-s {fasttree,fastme,quicktree,iqtree,iqtree2}] [-b BOOT] [--threads THREADS] [-m MODEL] [-B] [--mem MEMORY] [-fast] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Infer phylogenetic tree based on core/persistent genome

Required arguments:

-a ALIGNMENT Alignment file in multi-fasta: each header will be a leaf of the inferred tree.

-o OUTDIR Directory where tree results will be saved.

Choose soft to use (default is IQtree2):

-s {fasttree,fastme,quicktree,iqtree,iqtree2}, --soft {fasttree,fastme,quicktree,iqtree,iqtree2}

Choose with which software you want to infer the phylogenetic tree. Default is IQtree2 (versions 2.x of IQtree). If you want version 1.x of IQtree, use '-s iqtree'

Optional arguments:

-b BOOT, --boot BOOT Indicate how many bootstraps you want to compute. By default, no bootstrap is calculated. For IQtree, it will use ultrafast bootstrap (>=1000).

computer. Not available with quicktree.

-m MODEL, --model MODEL

Choose your DNA substitution model. Default for FastTree and IQtree: GTR. Default for FastME: TN93. For FastTree, the choices are 'GTR' and 'JC'. For FastME, choices are: 'p-distance' (or 'p'),

'RY symmetric' (or 'Y'), 'RY' (or 'R'), 'JC69' (or 'J'), 'K2P' (or 'K'), 'F81' (or '1'), 'F84' (or '4'), 'TN93' (or 'T'), 'LogDet' (or 'L'). For IQtree, choices are HKY, JC, F81, K2P, K3P, K81uf,

TNef, TIM, TIMef, TVM, TVMef, SYM, GTR, TEST. TEST to run standard model selection.

-B Add this option if you want to write all bootstrap pseudo-trees. Only available with FastME and IQtree.

--mem MEMORY Maximal RAM usage in GB | MB. Only available with iqtree.

-fast Use -fast option with iqtree.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA all -h

# PanACoTA all -h

usage: PanACoTA all [-c CONFIGFILE] -o OUTDIR [--threads THREADS] [-T NCBI_SPECIES_TAXID] [-s NCBI_SPECIES] [-l LEVELS] [--cutn CUTN] [--l90 L90] [--nbcont NBCONT] [--prodigal] -n NAME [-i MIN_ID] [--tol TOL] [-Mu] [-X]

[--soft {fasttree,fastme,quicktree,iqtree,iqtree2}] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Run all PanACoTA modules

General arguments:

-c CONFIGFILE Path to your configuration file, defining values of parameters.

-o OUTDIR Path to your output folder, where all results from all 6 modules will be saved.

--threads THREADS Specify how many threads can be used (default=1)

'prepare' module arguments:

-T NCBI_SPECIES_TAXID

Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. A comma-separated list of taxid can also be provided.

-s NCBI_SPECIES Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")

-l LEVELS, --assembly_level LEVELS

Assembly levels of genomes to download (default: all). Possible levels are: 'all', 'complete', 'chromosome', 'scaffold', 'contig'.You can also provide a comma-separated list of assembly levels.

For ex: 'complete,chromosome'

Common arguments to 'prepare' and 'annotate' modules:

this option. If you want to cut from a different number of 'N' in a row, put this value to this option.

--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.

--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.

'annotate' module arguments:

--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.

for Escherichia Coli.

'pangenome' module arguments:

-i MIN_ID Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.

'corepers' module arguments:

--tol TOL min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the minimum number of genomes

will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.

-Mu Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to allow exactly 1 member in

'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not compatible.

-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is not compatible with -M

(which is allowing multigenic families: having several members in any number of genomes).

'tree' module arguments:

--soft {fasttree,fastme,quicktree,iqtree,iqtree2}

Choose with which software you want to infer the phylogenetic tree. Default is IQtree.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

実行方法

１、２のPrepareとannotate コマンドを使うと、アセンブリ品質と分類の観点から要件に適合しないゲノムを排除しながらゲノムを集めることができる。

１、prepare - ゲノムのダウンロードとクオリティチェック

PanACoTAの最初のモジュールであるprepareを使うと、与えられたNCBI taxonomy IDから全ゲノムを取得することができる。このコマンドは、refseqやgenbankからゲノムをダウンロードし、Mash遺伝的距離に基づいて、冗長なゲノムや分類ミスしたゲノムをフィルタリングして保存することができる。

PanACoTA prepare -T 104099 -g "Acetobacter orleanensis" -p 1

-T Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. This will download all sequences of this species and all its sub-species.A comma-separated list of species taxids can also be provided. (Ex: -T 573 for Klebsiella pneumoniae)

-t Taxid to download. This can be the taxid of a sub-species, or of a specific strain. A taxid of a subspecies will download all strains in this subspecies EXCEPT the ones which have a specific taxid.A comma-separated list of taxids can also be provided.Ex: '-t 72407' will download all 'general' K. pneumoniae subsp. pneumoniae strains, and '-t 1123862' will download the strain K. pneumoniae subsp. pneumoniae Kp13 (not included in -t 72407, as it is a strain of the subspecies with a specific taxid).
-g Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")
-s {refseq, genbank} NCBI section to download: all genbank, or only refseq (default)
-l Assembly levels of genomes to download (default: all).
--cutn By default, each genome will be cut into new contigs when at least 5 'N' in a row are found in its sequence. If you don't want to cut genomes into new contigs when there are rows of 'N', put 0 to this option. If you want to cut from a different number of 'N' in a row, put this value to this option.
--l90 Maximum value of L90 allowed to keep a genome. Default is 100.
--min_dist By default, genomes whose distance to the reference is not between 1e-4 and 0.06 are discarded. You can specify your own lower limit (instead of 1e-4) with this option.
--max_dist By default, genomes whose distance to the reference is not between 1e-4 and 0.06 are discarded. You can specify your own lower limit (instead of 0.06) with this option.
-p Run 'N' downloads in parallel (default=1). Put 0 if you want to use all cores of your computer.

出力

f:id:kazumaxneo:20210908005228p:plain

refseq/bacteria/にはアセンブリごとに1つのフォルダが作られ、fasta.gz形式のアセンブリ配列とMD5SUMSが入っている。 Database_init/にはダウンロードしたすべてのアセンブリがfasta形式で含まれ、tmp_files/にはNが5以上続くストレッチで分割されたアセンブリがfasta形式で含まれている。assembly_summary-Vibrio_diabolicus.txtにはダウンロードした全ゲノムのリストと種名やIDなどが含まれている。PanACoTAは、廃棄されたゲノムとその理由をリストアップしたファイルも提供する。discarded-by-L90_nbcont-Acetobacter_orleanensis.lstファイルは品質管理ステップで廃棄されたゲノムのリストが含まれ、discarded-by-minhash-Acetobacter_orleanensis-0.0001_0.06.txtには、フィルタリングステップで廃棄されたゲノムのリストが含まれている。LSTINFO-Acetobacter_orleanensis-filtered-0.0001_0.06.txtには、チェックをパスしたゲノムのリストとコンティグ数、ゲノムサイズ、L904が保存される。

上の例では、４つのゲノムがダウンロードされ、MASH距離で遠かった3つのゲノムが排除と判断された (近すぎるゲノムも排除され得る)。フィルタリングをパスしたファイルだけ別に書き出されるわけではない。

フィルタリングをパスしたゲノムは LSTINFO～を確認する。
> cat LSTINFO-Acetobacter_orleanensis-filtered-0.0001_0.06.txt

refseqに登録されているゲノムだけでなく、genbankに登録されているすべてのゲノムをダウンロードしたい場合は、"-s genbank"オプションを使う。デフォルトは"-s refseq"でrefseqアセンブリはNCBIでのチェックが入っている分信頼性は高いが、最近ゲノムの登録が加速しており、最近数か月か1年以内に登録されたものだと、genbank アセンブリのみあってrefseqアセンブリは利用できないゲノムもかなりあったりするので注意する。

特定のアセンブリレベルのゲノムのみをダウンロードしたい場合は、オプション”-l”を使う（デフォルトは'all'）；ダウンロードしたいアセンブリレベルを'all', 'complete', 'chromosome', 'scaffold', 'contig'の間でコンマで区切って指定する。

ダウンロードは省略して、手持ちのゲノムをフィルタリングするためにprepareモジュールを使用することもできます。Documentを読んで下さい。

手持ちのゲノムをチェックして使うこともできる。

PanACoTA prepare --norefseq -o outdir -d your_genome_dir/

#閾値をゆるくする
PanACoTA  prepare --norefseq -o outdir -d your_genome_dir --min_dist 0 --max_dist 1 --cutn 0 --l90 1000

マニュアルとpreprintより

　ユーザーが同じ種のゲノムを解析していると仮定すると、それらのゲノムはコンティグの数と長さの点で比較的類似した特徴を持っているはずです。そのため、PanACoTAはまず、各ゲノムについて、コンティグの総数とL90（ゲノム全体の少なくとも90%を取得するために必要なコンティグの最小数）という2つの重要な指標を計算します。これら2つの変数の値が非常に高いということは、通常、シーケンシングやアセンブルの質が低いことを示しています。これらの値が高いと、多くの切り捨てられた遺伝子のアノテーションが行われ、パンゲノムのサイズが不自然に大きくなってしまいます（実際の遺伝子は、異なるファミリーに分類される多数のオープンリーディングフレームに分割されているため）。また、このような貧弱に組み立てられたゲノムは、小さなコンティグに支配されているため、遺伝的連鎖を研究することが重要な場合には、比較ゲノムの研究を複雑にします。しきい値はユーザーが指定することができます。デフォルトでは1000コンティグ以下、L90は100以下に設定されています。

　手順の第二部分は、冗長なゲノムや分類ミスのあるゲノムを除去するための専用のフィルターです。これは、Mashによって計算されたゲノムのペア間の遺伝的距離に基づいて行われます。この距離フィルタリングステップにMashを選んだのは、非常に高速に計算でき、密接に関連したゲノムに対しても正確であるからです。Mashは、MinHash技術を用いて、各ゲノム配列を代表的なk-merのスケッチに変換します。次に、完全な配列ではなく、これらのスケッチを比較します。この出力Mash距離Dは、ブラストアルゴリズムを用いた全ゲノム配列比較に基づく平均ヌクレオチド同一性(ANI)のようなアラインメントベースの尺度と強く相関します。D≈1 - ANI。ANIが90-100%の範囲であれば、スケッチサイズを大きくするとMash距離との相関はさらに強くなります。パンゲノムは通常、単一の細菌種に対して計算されるので、ここでは少なくとも94%の同一性を持つゲノムを識別するためにMashを使用しています。最近のプログラムでは、Mashよりもわずかに高い精度を示すものがいくつか発表されていますが、システマティックなフィルタとして使用するには遅すぎることがわかりました。

　細菌種は通常、94%以上の同一性を持つゲノムのグループとして定義され、これは通常、Dの上限値として使用されます(max_mash_distは変更可能なパラメータで、デフォルトでは0.06に固定されています)。もう一方の極端な例として、非常に高い類似性を持つ（低いマッシュ距離に対応する）ゲノムは、非常に類似した情報を提供します。これらを除外することで、解析に必要な計算資源を減らし、特定のクレードでのオーバーサンプリングを減少させる（＊メジャーな病原性細菌など）。PanACoTAでは、デフォルトでmin_mash_distを10^-4に設定していますが、このパラメータはユーザーが指定することもできます。この距離は、平均して10遺伝子ごとの1ポイントの変化を表しており、多くのドラフトゲノムのシーケンシングとアセンブルの精度に近いかもしれません。

２、annotate - ゲノムの均一なアテーション

PanACoTAの２つ目のモジュールであるannotateを使ってゲノムにアノテーションをつける。アノテーションの入力は、(multi)fasta形式のゲノム配列で、<db_path>で指定されたディレクトリに保存されている必要がある。ただし、提供するリストファイルにかかれたファイルのみ対象になるので、<db_path>で指定されたディレクトリにゲノム配列以外のファイルが含まれていても問題ない。

リストファイル（list_genomes.ls）には、アノテートするゲノム名を1行ずつ書く。リストファイルにはファイルのパスを書くわけではないので注意する（ゲノムファイルのパスは-d <pasth>で指定）。

genome1.fst
genome2.fst :: GEN2
genome3-chromo.fst genome3-plasmid.fst :: .1216

リストファイルとリストのファイルのパス、出力ディレクトリ、名前 (4文字) などを指定する。リストファイルは、prepareの出力のフィルタリングをパスしたゲノムのリストファイルである”LSTINFO～”のパス部分だけを消して使うこともできる（2列目以降は無視される）。"

-r annotate"は出力ディレクトリ。デフォルトでは、L90（全長の長さの90%に達した時のコンティグ数）が100以上のゲノムはアノテーションされずに捨てられる。これはそれなりに厳しい閾値なので、連続性の低いドラフトゲノムはかなり捨てられる事になる。例えば200に変えるには”--l90 200”とタイプする。

PanACoTA annotate -l list -d genome_dir -r annotate -n <name> --threads 20 --l90 100

-d DB_PATH Path to folder containing all multifasta genome files
-l LIST_FILE File containing the list of genome filenames to annotate (1 genome per line). Each genome must be in multi-fasta format. You can specify the species name (4 characters) you want to give to some genome, as well as the
download (or any other reason) date of your choice. Format 'genome_name :: name.date'. name and date are optional. See doc for more information on this file format. If you want to run this module from 'prepare_module' results, use '--info' instead.
-r RES_PATH Path to folder where output annotated genomes must be saved
-Q Add this option if you want only to do quality control on your genomes (cut at 5N if asked, calculate L90 and number of contigs and plot their distributions). This allows you to check which genomes would be annotated with the given parameters, and to modify those parameters if you want, before you launch the annotation and formatting steps.
--nbcont Maximum number of contigs allowed to keep a genome. Default is 999.
--l90 Maximum value of L90 allowed to keep a genome. Default is 100.
-n Choose a name for your annotated genomes. This name should contain 4 alphanumeric characters. Generally, they correspond to the 2 first letters of genus, and 2 first letters of species, e.g. ESCO for Escherichia Coli.
--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.
--small If you use Prodigal to annotate genomes, if you sequences are too small (less than 20000 characters), it cannot annotate them with the default options. Add this option to use 'meta' procedure.
--threads Specify how many threads can be used (default=1)

出力

f:id:kazumaxneo:20210908023248p:plain

このステップの出力は、ゲノムごとに5つのファイルで構成されている：オリジナルの配列、遺伝子、タンパク質（すべてfasta形式）、すべてのアノテーションを含むgffファイル、およびサマリー情報ファイル、となる。例えばgff3/には各ゲノムからのgff3形式のアノテーションされたファイルだけが保存されている。

QC_L90-list_genomes.png

f:id:kazumaxneo:20210908023316p:plain

-Qオプションを付けて実行すると、アノテーションは無しでクオリティチェックのみ行われる。

マニュアルとpreprintより

　ゲノムのアノテーションは、PanACoTAのアノテートモジュールによって行われます。入力は、prepareモジュールまたはユーザーから直接提供されたfasta配列のデータベースです。これらのゲノムの品質管理（コンティグ数や L90）に関する情報がない場合は、ここで品質管理を行います（品質管理の詳細については前のセクションを参照してください）。このモジュールの目的は、データセット全体の遺伝子位置（および機能）の均一なアノテーションを提供することであるため、PanACoTA上でProdigalを使用して全てのゲノムの遺伝子位置を特定し、Prokkaを用いて全てのゲノムを均一にアノテーションします。また、その際には、BLAST+を用いてUniprotから取得したタンパク質のデータベース内のホモログを検索したり、HMMER3を用いてTIGRFAMやPFAMから選択したプロファイルに合致するタンパク質を検索したりと、一連のプログラムを用いて機能的なアノテーションも追加します。すべてのアノテーションされた配列は、標準的な配列ヘッダー形式を用いてリネームされています。各遺伝子のヘッダーには20文字の文字が含まれており、遺伝子のゲノムとコンティグ、ゲノム中の相対位置、コンティグの境界にあるかどうかについてのヒトが読める情報を提供しています。

３、pangenome - パンゲノムの計算

PanACoTAのpangenomeモジュールを用いてパンゲノムの計算を行う。入力は全ゲノムの全タンパク質の集合で、annotateモジュールで生成されたProteinsディレクトリの配列ファイル（.prtファイル; multi-fasta）を使う(ヘッダーがPanACoTAの命名規則に則っている)。リストファイルはannotateモジュールで生成されたLSTINFO-.txt（マニュアルの解説）をそのまま使用できる。

PanACoTA pangenome -d annotate/Proteins -l annotate/LSTINFO-.lst -n <pangenome-name> -o outdir --threads 20

-l File containing the list of all genomes to include in the pan-genome, 1 genome per line: it can be the LSTINFO-<list_file>.lst file of 'PanACoTA annotate' module

-n Name of the dataset which will be clustered (for example, SAEN1234 for 1234 Salmonella enterica genomes). This name will be used to name the protein databank, a well as the pangenome files.
-d Path to the folder containing all protein files corresponding to the genomes of the dataset (output directory 'Proteins' of 'PanACoTA annotate' module).
-o Output directory, where all results must be saved (including tmp folder)
-i Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.
-f Use this option if you want to give the name of the pangenome output file (without path).
-c {0,1,2} Choose the clustering mode: 0 for 'set cover', 1 for 'single-linkage', 2 for 'CD-Hit'. Default is 'single-linkage' (1)
-s use this option if you want to save the concatenated protein databank in another directory than the one containing all individual protein files ('Proteins' folder).
--threads add this option if you want to parallelize on several threads. Indicate on how many threads you want to parallelize. By default, it uses 1 thread. Put 0 if you want to use all threads of your computer.

出力

f:id:kazumaxneo:20210908112321p:plain

パンゲノムファイルは、1ファミリーにつき1行で構成されている。最初の列はファミリー番号で、その他はすべてのファミリーメンバーになる。

PanGenome-pangenome.All.prt-clust-0.8-mode1.lst

f:id:kazumaxneo:20210908112853p:plain

各ファミリーごとに1行、全メンバーのリストを含む。

PanACoTAは、各ファミリーの構成の概要を示す表形式のファイルと、パンゲノムの量的および質的マトリックスも提供する。

マニュアルとpreprintより

　pangenomeはゲノム中のすべてのタンパク質ファミリーの集合です。その計算には、すべてのタンパク質のペア間の比較が含まれており、その複雑さは遺伝子の数（つまりゲノムの数）の二乗になります。信頼性の高いパンゲノムを合理的な時間で生成するために、PanACoTAはMMseqs2スイートを呼び出します。mmseqs検索モジュールは、非常に優れたスピードと感度のトレードオフを持っています。時間を短縮するために、感度を上げたり、速度を下げたりしながら、3つの連続した検索ステージを使用しています。すべてが高度に並列化され、複数のレベルで最適化されています。最初のステップでは、高い非類似性、すなわち、少なくとも2つの連続したkmerマッチを持たない配列を排除することで、最大99.9%の配列をフィルタリングします。第2段階では、ギャップなしのアラインメントを使用して残りの99%の配列をフィルタリングします。これにより、完全なアラインメントではなく、スコアのみが計算されるSmith-Watermanアライメントの最適化されたバージョンを使用して処理する配列の量が少なくなります。

　MMseqs2スイートに含まれるmmseqs clusterモジュールを使用し、デフォルトのCascaded clusteringオプションを設定しました。このモジュールは主に2つのステップで動作します。まず、線形時間タンパク質配列クラスタリングアルゴリズムであるlinclustをプレフィルターとして使用してタンパク質をクラスタリングします。次に、この最初のステップの代表的な配列が mmseqs 検索モジュールによって処理され、その結果に応じてクラスタリングされます。この第２ステップを３回繰り返し、その都度mmseqs検索アルゴリズムモジュールで高感度化します。

　クラスタリングの段階では、PanACoTAはConnected componentモードを使用します。このモードでは、相同遺伝子のペアをマージするために遷移的接続を使用します。各ノードはタンパク質であり、2つのタンパク質が類似している（与えられた閾値以上の類似度）場合、2つのタンパク質の間にエッジが存在します。各ノードはタンパク質であり、2つのタンパク質が類似していれば（類似度が与えられた閾値を超えていれば）、2つのタンパク質の間にエッジがあります。必要に応じて、ユーザーは、PanACoTAパンゲノムモジュールを起動している間に、専用のパラメータを使用して、他の2つのクラスタリングモード（Greedy Set cover、またはGreedy Incremental）のいずれかを選択することができます。重要なことは、mmseqs2のオプションを調整することで、配列類似性解析を非常に高速に、または非常に高感度に行うことができるということです。PanACoTAでは、ユーザーは主要なパラメータ-min-seq-idと-cluster-modeを変更し、結果への影響を調べるためにmmseqsクラスタモジュールを再実行することができます。より具体的なmmseqs2パラメータは、当面の間、スタンドアロン版のプログラムで使用されます。

　なお、ドラフトゲノムの場合、ゲノム中の遺伝子間のシンテニーは考慮していませんが、ドラフトゲノムの場合、これは限られた興味しかないと考えています。生成されたファミリーを調べてみると、ほとんど差がなく、有意ではないことがわかりました。しかし、ユーザーが非常によくアセンブルされたゲノムを持っている場合や、シンテニーを考慮する特別な理由がある場合には、panOCT、SynerClust、PANINIなどのツールが開発されています。出力されるパンゲノムのファミリーごとの要素数を示すファイルは、直接TreeWASの入力として使用することができます。

４、corepers - コア・persistent遺伝子ファミリーの計算

corepersモジュールによって、多数の分類群に存在するコア・persistent遺伝子ファミリーの分類を行う。これには、３のpangenomeモジュールで生成されたpangenomeファイルを指定する。"-t"オプションで、ゲノムの何パーセントに含まれている遺伝子をpersistentな遺伝子（*1）とするか定義する（デフォルトは1.0で全てのゲノムに存在するコア遺伝子となっている）。”-X"や"-M"オプションを使うと定義をより緩やかにすることができる。

PanACoTA corepers -p pangenome/PanGenome-pangenome.All.prt-clust-0.8-mode1.lst -o core

#90%
PanACoTA corepers -p pangenome/PanGenome-pangenome.All.prt-clust-0.8-mode1.lst -o core -t 0.9

#90%、複数ヒット許可（全ゲノム）
PanACoTA corepers -p pangenome/PanGenome-pangenome.All.prt-clust-0.8-mode1.lst -o core -t 0.9 -M

-p PanGenome file (1 line per family, first column is fam number)
-o Specify the output directory for your core/persistent genome.
-t min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the minimum number of genomes will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.
-M Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to allow exactly 1 member in 'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not compatible.
-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is not compatible with -M (which is allowing multigenic families: having several members in any number of genomes).
-F When you specify the '-tol' option, with a number lower than 1, you can add this option to use floor('tol'*N) as a minimum number of genomes instead of ceil('tol'*N) which is the default behavior.

出力

f:id:kazumaxneo:20210908114219p:plain

PersGenome_PanGenome-pangenome.All.prt-clust-0.8-mode1.lst-all_1.lst

f:id:kazumaxneo:20210908114337p:plain

マニュアルとpreprintより

初期の研究では、パンゲノムマトリックスを用いて、コアゲノムという単一コピーで全ゲノムに存在する遺伝子ファミリーを同定していました。しかし、データセットのゲノム数の増加に伴い、コアゲノムのサイズは大幅に減少する傾向にあります。これは、シーケンスやアノテーションの誤りや、集団内でのまれな欠失多型が、入力ゲノム数の増加に伴ってコア遺伝子の数を急速に減少させるためです。この問題を克服するために、現在では一般的にpersistentゲノムを同定することが行われています。あるファミリーは、ゲノムの少なくともN%のメンバーを含んでいれば、persistentゲノムに含まれていることになります。Nはユーザーが定義します。デフォルト値は、1000以上のゲノムを持つデータセットでは95%です。persistentゲノムは、まれな（真の、または人工的な）変種に対してよりロバストです。一方、persistentゲノムを計算する目的が系統樹の作成や集団遺伝学データの解析である場合、異なるしきい値を持つセットを作成したいと思うかもしれません。実際、あまりにも高い値を設定すると、ギャップの少ないpersistentな遺伝子のセットが小さくなり、堅牢な系統樹を推定するのに十分な値になるかもしれません。一方で、これは、正の選択や組換えイベントの検出のような他のタイプの解析から多くの遺伝子ファミリーを除外することになります。また、persistentなゲノムの定義も、その後のデータの利用方法によって異なる場合があります。PanACoTAでは、3つのタイプのpersistentゲノムを定義しています（論文図3参照）。

Strict-persistent: 少なくともN%のゲノムに正確に1つのメンバーを含むファミリー（N = 100はコアファミリーであることを意味します）。この定義は、ゲノムごとに複数のコピーが存在することを考慮せずに系統を再構築する場合に特に有効です。

Mixed-persistent: 少なくともN%のゲノムが正確に1つのメンバーを持ち、他のゲノムが0か、複数のメンバーを持つファミリー。この定義は他の2つの間の中間的なもので、厳密-persistentなものを含み、multi-persistentなものも含まれます。

Multi-persistent: N%のゲノムに少なくとも1つのメンバーを持つファミリー。この定義は、ほぼユビキタスなタンパク質ファミリーの多様化のパターンを分析する上で興味深い。

ユーザーが固定したしきい値を使用するのではなく、統計的アプローチを使用してpersistentなゲノムを識別したい場合、アノテートモジュールによって生成されたgffファイルはPPanGGOLiNと互換性がある。PPanGGOLiNは、ここでのmulti-persistentな定義でのpersistentゲノムを生成します。

５、align - persistentな遺伝子ファミリーのアラインメントを行う

alignモジュールによってpersistentな遺伝子ファミリーのアラインメントを行う。”-c”で定義したコア・persistent遺伝子ファミリーファイルを指定する。”-l”で指定するリストファイルには、annotateモジュールで出力されたLSTINFO-.txtが使える。”-d”では、annotateモジュールで出力されたディレクトリを指定する。計算時間がかかるので、”-threads”オプションを使ってスレッド数を増やせるだけ増やしておく。”-n”で指定した名前は、出力される全てのファイルのprefixとして使用される。

PanACoTA align -c core/PersGenome_PanGenome-pangenome.All.prt-clust-0.8-mode1.lst-all_1.lst \
 -l annotate/LSTINFO-.lst -n <name_of_the_dataset> -d annotate/ -o align --threads 20

-c Core or persistent genome whose families must be aligned.
-l File containing the list of all the genomes you want to align from their core/persistent families. 1 genome per line: it can be the LSTINFO-<list_file>.lst file of 'PanACoTA annotate' module. Here, only the first column (genome name without extension) will be used. The final alignment file will contain 1 alignment per genome in this file.
-n Name of the dataset which will be aligned (for example, SAEN1234 for 1234 Salmonella enterica genomes). This name will be used to name the alignment file.
-d Path to the folder containing the directories 'Proteins' and 'Genes', created by 'PanACoTA annotate'.
-o Output directory, where all results must be saved
--threads add this option if you want to parallelize on several threads. Indicate on how many threads you want to parallelize. By default, it uses 1 thread. Put 0 if you want to use all threads of your computer.

出力

f:id:kazumaxneo:20210908144646p:plain

phylo-xxx/にすべてのコア・persistent遺伝子ファミリーのアラインメントを連結したファイルが出力される（xxxは"-n"で指定したprefix）。このファイルをツリー推論に使う。出力ディレクトリには、各コア・persistent遺伝子ファミリーについて、その遺伝子とタンパク質の配列をアラインメントしたファイルなども含まれている。

マニュアルとpreprintより

　厳密なpersistentゲノム（コアゲノム）を使用する場合は、すべての遺伝子がアラインメントされます。他の定義のpersistentゲノムを使用した場合、遺伝子を欠いていたり、複数のコピーを持っていたりするゲノムもありますが、これらのアラインメントから系統樹を作成するためには、あらかじめそのような場合に対応する必要があります。ある遺伝子ファミリーのメンバーが欠落していたり、複数のメンバー（混合またはmulti-persistent）が存在する場合、PanACoTAは他のアラインメントされた遺伝子と同じ長さのギャップ（'-'）を追加します。いくつかの"-"を追加しても、系統再構成にはほとんど影響を与えません。例えば、アラインメントマトリックスに最大60%の欠落データを追加しても、有益なアラインメントが得られることが示されています[ref.42]。著者らの経験では、このアプローチを種内パンゲノムに適用した場合、通常は1%未満の欠落が含まれています。このように、欠損データの影響は、より多くの遺伝子からの系統シグナルを使用する利点に比べれば、無視できる程度のものであるはずです（すなわち、厳密な一貫性のあるゲノムを使用する場合とは対照的です）。アライメントは、タンパク質配列のレベルで行うと、より正確になります。これは、コドンベースのヌクレオチド・アライメントを生成するという付加的な利点があり、コーディング配列に対する選択圧を研究するために使用することができます。そのため、PanACoTAは配列を翻訳し、対応するタンパク質をアラインメントした後、それらをDNAに逆翻訳してヌクレオチドアラインメントを取得します。この最後のステップは、各アミノ酸を元のコドンに置き換えることで構成されています。したがって、このプロセスの最後には、アラインメントされた配列は元の配列と同じになります。

　PanACoTAは、MAFFTを用いた多重配列アラインメントを行います。PanACoTAには、非常に大規模なデータセットを扱うために、ある程度の精度を犠牲にしながらも、はるかに高速なアライメントを行うことができるオプションがあります。この精度の低下は、種内のオーソロガスな遺伝子ファミリーの場合のように、非常に類似した配列の場合では影響は低く、PanACoTAはpersistentなゲノムのアラインメントを非常に迅速に行うことができることを意味します。

６、tree - ツリーの再構成

系統推論は、PanACoTAのtreeモジュールを用いて行う。このモジュールは、AlignモジュールのアラインメントやFastaフォーマットの他のアラインメントを入力として使う。

#iqtree
PanACoTA tree -a align/phylo-xxx/xxx.nucl.grp.aln -o tree --threads 40 -b 1000

#fasttree ("FastTreeMP"しか認識しないので注意)
PanACoTA tree -a align/phylo-xxx/xxx.nucl.grp.aln -o tree -s fasttree

このステップまでくれば、PanACoTAに頼らずにiqtreeなどを走らせても良いと思われます。

ちなみに、系統樹に視覚化した時、アノテーションをかけてリネームされているので、元の種名や株名が分からなくなります。その時はannotate/のLSTINFO-か、prepare/のリストファイルを見て下さい。

マニュアルとpreprintより

系統推論に必要な時間は、データセットのサイズに応じて非常に速く増加するため、この部分がパイプライン全体の中で最も時間がかかる部分です。最尤解析の効率的な実装でさえ、サイト数と分類群数の積でスケーリングしてしまうため、大規模なデータセット（数千の分類群、各分類群のサイト数が1万サイト以上）の場合には問題となります。PanACoTAは系統樹を得るためにいくつかの異なる方法を提案している。IQ-TREE、FastTreeME、fastME、Quicktree です。どのようなソフトウェアを使用しても、ツリーモジュールはFasta形式のヌクレオチドアラインメント（例えば、alignモジュールの出力のようなもの）を入力として受け取り、Newick形式のツリーを返します。ニーズに応じて、ユーザーはこれらの方法のいずれかを選択して系統樹を推論することができます。これらのツリーは、IQ-TREEのオプションを変更するなどして、より厳密な系統樹推論を行うために使用することができます。

感想

　これまでのパンゲノム解析では、ゲノムを集めたりアノテーションをつけるのはユーザーの責任として、ツールとは無関係のステップになっていました（もしくは、おまけ扱いでスクリプトのみ利用できた）。しかし、パンゲノムの計算では、入力データの品質が鍵になります。ドラフトゲノムの品質の悪さやアノテーションの揺らぎに起因するコアゲノムセットの変動をできる限りゼロに近づけるのは極めて重要な事です。このPanACoTAは、モジュール方式を採用することで、その前段階ステップまで踏み込んでコードと環境を整備しており、一定の水準のゲノムだけを再現性良く集めることができます。均一な水準で得られたゲノム情報だけ使う事で、より正確にパンゲノムを定義できるはずです。

　ツールの完成度が高いだけでなく、細部まで説明された完全なマニュアルと、ユーザーの立場に立ったガイドが提供されています。初めて微生物の比較ゲノム解析を行おうとしている人にも利用しやすい環境が整えられているのではないでしょうか。個人的には、このツールは開発の終了したroaryの乗り換え先に使えそうに見えたので、プレプリント発表直後から注目していました。しかしなかなかまとまった時間が無くて、まとめるのに時間がかかってしまいました。正式なDocumentを見た方が良いのは間違いありませんが、よろしければ参考になさって下さい。

引用

PanACoTA: a modular tool for massive microbial comparative genomics
Amandine Perrin, Eduardo P.C. Rocha

NAR Genom Bioinform. 2021 Mar; 3(1): lqaa106.

PanACoTA: A modular tool for massive microbial comparative genomics

Amandine Perrin, Eduardo P.C. Rocha

bioRxiv, Posted September 18, 2020

*１

persistent遺伝子

コア遺伝子を厳密に定義する時の問題点として、解析対象のゲノム数が増えると、gene lossや技術的な問題（アセンブリ、アノテーションなど）によってコアゲノムのサイズが減少することが知られています。よくあるのは、1個か2個だけ品質が低かったり分類が間違って登録されているゲノムが含まれているために、100％保存されているコア遺伝子が大きく減ってしまうパターンです（例）。そのため，大多数のゲノムで保存されているpersistent（持続的な、落ちない、変質しにくい）な遺伝子に注目することが提案されています。