パンゲノム解析ツール PanACoTAのallコマンドを使う

　PanACoTAはモジュール方式のパイプラインなので、ゲノムの準備、品質チェックとフィルタリング、アノテーション、パンゲノムの計算、コア・persistant遺伝子の定義、系統解析まで順番に進めることができますが、allコマンド（説明）を使えば、全部のプロセスをまとめてランすることもできます。昨日に続いて、今日はこのallコマンドを使う手順を確認してみます。

昨日の手順で作った環境をアクティベートしておきます。

conda activate panacota

> PanACoTA all -h

# PanACoTA all -h

usage: PanACoTA all [-c CONFIGFILE] -o OUTDIR [--threads THREADS] [-T NCBI_SPECIES_TAXID] [-s NCBI_SPECIES] [-l LEVELS] [--cutn CUTN] [--l90 L90] [--nbcont NBCONT] [--prodigal] -n NAME [-i MIN_ID]

[--tol TOL] [-Mu] [-X] [--soft {fasttree,fastme,quicktree,iqtree,iqtree2}] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Run all PanACoTA modules

General arguments:

-c CONFIGFILE Path to your configuration file, defining values of parameters.

-o OUTDIR Path to your output folder, where all results from all 6 modules will be saved.

--threads THREADS Specify how many threads can be used (default=1)

'prepare' module arguments:

-T NCBI_SPECIES_TAXID

Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. A comma-separated list of taxid can also be provided.

-s NCBI_SPECIES Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")

-l LEVELS, --assembly_level LEVELS

Assembly levels of genomes to download (default: all). Possible levels are: 'all', 'complete', 'chromosome', 'scaffold', 'contig'.You can also provide a comma-separated

list of assembly levels. For ex: 'complete,chromosome'

Common arguments to 'prepare' and 'annotate' modules:

--cutn CUTN By default, each genome will be cut into new contigs when at least 5 'N' in a row are found in its sequence. If you don't want to cut genomes into new contigs when there

are rows of 'N', put 0 to this option. If you want to cut from a different number of 'N' in a row, put this value to this option.

--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.

--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.

'annotate' module arguments:

--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.

-n NAME Choose a name for your annotated genomes. This name should contain 4 alphanumeric characters. Generally, they correspond to the 2 first letters of genus, and 2 first

letters of species, e.g. ESCO for Escherichia Coli.

'pangenome' module arguments:

-i MIN_ID Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.

'corepers' module arguments:

--tol TOL min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the

minimum number of genomes will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.

-Mu Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to

allow exactly 1 member in 'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not

compatible.

-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is

not compatible with -M (which is allowing multigenic families: having several members in any number of genomes).

'tree' module arguments:

--soft {fasttree,fastme,quicktree,iqtree,iqtree2}

Choose with which software you want to infer the phylogenetic tree. Default is IQtree.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

手順

allコマンドでは、オプションでパラメータを指定することもできますが、全てのパラメータを記載したconfigファイルを使った方が細かくパラメータを調整できます。レポジトリのソースコード中にexampleディレクトリがあるので、レポジトリをcloneして取得します。

git clone https://github.com/gem-pasteur/PanACoTA.git

> ls -l PanACoTA/Examples/input_files/

f:id:kazumaxneo:20210909003532p:plain

configfile.iniがconfigで指定できる全てのパラメータです。見てみます。

PanACoTA/Examples/input_files/configfile.iniを開く

f:id:kazumaxneo:20210909120804p:plain

コメントを外して数値や文字を記載すれば認識します。パラメターのいくつかをアクティブにしてみました。

f:id:kazumaxneo:20210909120905p:plain

ランします。

PanACoTA all -c configfile.ini -o outdir -n test

出力

outdir/

f:id:kazumaxneo:20210909011334p:plain

上手くラン出来ています。

引用

PanACoTA: a modular tool for massive microbial comparative genomics

Amandine Perrin, Eduardo P.C. Rocha

NAR Genom Bioinform. 2021 Mar; 3(1): lqaa106.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

パンゲノム解析ツール PanACoTAのallコマンドを使う