PanACoTAはモジュール方式のパイプラインなので、ゲノムの準備、品質チェックとフィルタリング、アノテーション、パンゲノムの計算、コア・persistant遺伝子の定義、系統解析まで順番に進めることができますが、allコマンド(説明)を使えば、全部のプロセスをまとめてランすることもできます。昨日に続いて、今日はこのallコマンドを使う手順を確認してみます。
昨日の手順で作った環境をアクティベートしておきます。
conda activate panacota
> PanACoTA all -h
# PanACoTA all -h
usage: PanACoTA all [-c CONFIGFILE] -o OUTDIR [--threads THREADS] [-T NCBI_SPECIES_TAXID] [-s NCBI_SPECIES] [-l LEVELS] [--cutn CUTN] [--l90 L90] [--nbcont NBCONT] [--prodigal] -n NAME [-i MIN_ID]
[--tol TOL] [-Mu] [-X] [--soft {fasttree,fastme,quicktree,iqtree,iqtree2}] [-v] [-q] [-h]
___ _____ ___ _____ _____
( _`\ ( _ )( _`\ (_ _)( _ )
| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |
| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |
| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |
(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)
Large scale comparative genomics tools
-------------------------------------------
=> Run all PanACoTA modules
General arguments:
-c CONFIGFILE Path to your configuration file, defining values of parameters.
-o OUTDIR Path to your output folder, where all results from all 6 modules will be saved.
--threads THREADS Specify how many threads can be used (default=1)
'prepare' module arguments:
-T NCBI_SPECIES_TAXID
Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. A comma-separated list of taxid can also be provided.
-s NCBI_SPECIES Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")
-l LEVELS, --assembly_level LEVELS
Assembly levels of genomes to download (default: all). Possible levels are: 'all', 'complete', 'chromosome', 'scaffold', 'contig'.You can also provide a comma-separated
list of assembly levels. For ex: 'complete,chromosome'
Common arguments to 'prepare' and 'annotate' modules:
--cutn CUTN By default, each genome will be cut into new contigs when at least 5 'N' in a row are found in its sequence. If you don't want to cut genomes into new contigs when there
are rows of 'N', put 0 to this option. If you want to cut from a different number of 'N' in a row, put this value to this option.
--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.
--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.
'annotate' module arguments:
--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.
-n NAME Choose a name for your annotated genomes. This name should contain 4 alphanumeric characters. Generally, they correspond to the 2 first letters of genus, and 2 first
letters of species, e.g. ESCO for Escherichia Coli.
'pangenome' module arguments:
-i MIN_ID Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.
'corepers' module arguments:
--tol TOL min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the
minimum number of genomes will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.
-Mu Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to
allow exactly 1 member in 'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not
compatible.
-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is
not compatible with -M (which is allowing multigenic families: having several members in any number of genomes).
'tree' module arguments:
--soft {fasttree,fastme,quicktree,iqtree,iqtree2}
Choose with which software you want to infer the phylogenetic tree. Default is IQtree.
Others:
-v, --verbose Increase verbosity in stdout/stderr.
-q, --quiet Do not display anything to stdout/stderr. log files will still be created.
-h, --help show this help message and exit
For more details, see PanACoTA documentation.
手順
allコマンドでは、オプションでパラメータを指定することもできますが、全てのパラメータを記載したconfigファイルを使った方が細かくパラメータを調整できます。レポジトリのソースコード中にexampleディレクトリがあるので、レポジトリをcloneして取得します。
git clone https://github.com/gem-pasteur/PanACoTA.git
> ls -l PanACoTA/Examples/input_files/
configfile.iniがconfigで指定できる全てのパラメータです。見てみます。
PanACoTA/Examples/input_files/configfile.iniを開く
コメントを外して数値や文字を記載すれば認識します。パラメターのいくつかをアクティブにしてみました。
ランします。
PanACoTA all -c configfile.ini -o outdir -n test
出力
outdir/
上手くラン出来ています。
引用
PanACoTA: a modular tool for massive microbial comparative genomics
Amandine Perrin, Eduardo P.C. Rocha
NAR Genom Bioinform. 2021 Mar; 3(1): lqaa106.