macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

パンゲノム解析ツール PanACoTAのallコマンドを使う

 

 PanACoTAはモジュール方式のパイプラインなので、ゲノムの準備、品質チェックとフィルタリング、アノテーション、パンゲノムの計算、コア・persistant遺伝子の定義、系統解析まで順番に進めることができますが、allコマンド(説明)を使えば、全部のプロセスをまとめてランすることもできます。昨日に続いて、今日はこのallコマンドを使う手順を確認してみます。

昨日の手順で作った環境をアクティベートしておきます。

conda activate panacota

> PanACoTA all -h

# PanACoTA all -h

usage: PanACoTA all [-c CONFIGFILE] -o OUTDIR [--threads THREADS] [-T NCBI_SPECIES_TAXID] [-s NCBI_SPECIES] [-l LEVELS] [--cutn CUTN] [--l90 L90] [--nbcont NBCONT] [--prodigal] -n NAME [-i MIN_ID]

                    [--tol TOL] [-Mu] [-X] [--soft {fasttree,fastme,quicktree,iqtree,iqtree2}] [-v] [-q] [-h]

 

 ___                 _____  ___         _____  _____

(  _`\              (  _  )(  _`\      (_   _)(  _  )

| |_) )  _ _   ___  | (_) || ( (_)   _   | |  | (_) |

| ,__/'/'_` )/' _ `\|  _  || |  _  /'_`\ | |  |  _  |

| |   ( (_| || ( ) || | | || (_( )( (_) )| |  | | | |

(_)   `\__,_)(_) (_)(_) (_)(____/'`\___/'(_)  (_) (_)

 

       Large scale comparative genomics tools

 

     -------------------------------------------

 

=> Run all PanACoTA modules

 

General arguments:

  -c CONFIGFILE         Path to your configuration file, defining values of parameters.

  -o OUTDIR             Path to your output folder, where all results from all 6 modules will be saved.

  --threads THREADS     Specify how many threads can be used (default=1)

 

'prepare' module arguments:

  -T NCBI_SPECIES_TAXID

                        Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. A comma-separated list of taxid can also be provided.

  -s NCBI_SPECIES       Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")

  -l LEVELS, --assembly_level LEVELS

                        Assembly levels of genomes to download (default: all). Possible levels are: 'all', 'complete', 'chromosome', 'scaffold', 'contig'.You can also provide a comma-separated

                        list of assembly levels. For ex: 'complete,chromosome'

 

Common arguments to 'prepare' and 'annotate' modules:

  --cutn CUTN           By default, each genome will be cut into new contigs when at least 5 'N' in a row are found in its sequence. If you don't want to cut genomes into new contigs when there

                        are rows of 'N', put 0 to this option. If you want to cut from a different number of 'N' in a row, put this value to this option.

  --l90 L90             Maximum value of L90 allowed to keep a genome. Default is 100.

  --nbcont NBCONT       Maximum number of contigs allowed to keep a genome. Default is 999.

 

'annotate' module arguments:

  --prodigal            Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.

  -n NAME               Choose a name for your annotated genomes. This name should contain 4 alphanumeric characters. Generally, they correspond to the 2 first letters of genus, and 2 first

                        letters of species, e.g. ESCO for Escherichia Coli.

 

'pangenome' module arguments:

  -i MIN_ID             Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.

 

'corepers' module arguments:

  --tol TOL             min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the

                        minimum number of genomes will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.

  -Mu                   Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to

                        allow exactly 1 member in 'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not

                        compatible.

  -X                    Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is

                        not compatible with -M (which is allowing multigenic families: having several members in any number of genomes).

 

'tree' module arguments:

  --soft {fasttree,fastme,quicktree,iqtree,iqtree2}

                        Choose with which software you want to infer the phylogenetic tree. Default is IQtree.

 

Others:

  -v, --verbose         Increase verbosity in stdout/stderr.

  -q, --quiet           Do not display anything to stdout/stderr. log files will still be created.

  -h, --help            show this help message and exit

 

For more details, see PanACoTA documentation.

 

 

手順

allコマンドでは、オプションでパラメータを指定することもできますが、全てのパラメータを記載したconfigファイルを使った方が細かくパラメータを調整できます。レポジトリのソースコード中にexampleディレクトリがあるので、レポジトリをcloneして取得します。

git clone https://github.com/gem-pasteur/PanACoTA.git

> ls -l PanACoTA/Examples/input_files/

f:id:kazumaxneo:20210909003532p:plain

configfile.iniがconfigで指定できる全てのパラメータです。見てみます。

PanACoTA/Examples/input_files/configfile.iniを開く

f:id:kazumaxneo:20210909120804p:plain

 

コメントを外して数値や文字を記載すれば認識します。パラメターのいくつかをアクティブにしてみました。

f:id:kazumaxneo:20210909120905p:plain



 

ランします。

PanACoTA all -c configfile.ini -o outdir -n test

出力

outdir/

f:id:kazumaxneo:20210909011334p:plain

 

上手くラン出来ています。

引用

PanACoTA: a modular tool for massive microbial comparative genomics

Amandine Perrin, Eduardo P.C. Rocha

NAR Genom Bioinform. 2021 Mar; 3(1): lqaa106.