2021-09-15

MMseqs2 コマンド其の3 既存のデータベースをダウンロードするmmseqs databasesコマンド

database download

MMseqs2には非常に多くの機能があります。今回はmmseqs databasesコマンドを試します。mmseqs databasesを使うと、UniProtやGTDB、NCBI nr/ntなどからMMseqs2のデータベースとしてビルド済みのデータベースをダウンロードして、MMseqs2によるホモロジーサーチや分類群のアサイン（*1）に使用することができます。

インストール

以前の記事を参照

> mmseqs -h

# mmseqs -h

MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast,

parallelized protein sequence searches and clustering of huge protein sequence data sets.

Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).

MMseqs2 Version: 13.45111

usage: mmseqs <command> [<args>]

Easy workflows for plain text input/output

easy-search Sensitive homology search

easy-linsearch Fast, less sensitive homology search

easy-cluster Slower, sensitive clustering

easy-linclust Fast linear time cluster, less sensitive clustering

easy-taxonomy Taxonomic classification

easy-rbh Find reciprocal best hit

Main workflows for database input/output

search Sensitive homology search

linsearch Fast, less sensitive homology search

map Map nearly identical sequences

rbh Reciprocal best hit search

linclust Fast, less sensitive clustering

cluster Slower, sensitive clustering

clusterupdate Update previous clustering with new sequences

taxonomy Taxonomic classification

Input database creation

databases List and download databases

createdb Convert FASTA/Q file(s) to a sequence DB

createindex Store precomputed index on disk to reduce search overhead

createlinindex Create linsearch index

convertmsa Convert Stockholm/PFAM MSA file to a MSA DB

tsv2db Convert a TSV file to any DB

tar2db Convert content of tar archives to any DB

msa2profile Convert a MSA DB to a profile DB

Handle databases on storage and memory

compress Compress DB entries

decompress Decompress DB entries

rmdb Remove a DB

mvdb Move a DB

cpdb Copy a DB

lndb Symlink a DB

unpackdb Unpack a DB into separate files

touchdb Preload DB into memory (page cache)

Unite and intersect databases

createsubdb Create a subset of a DB from list of DB keys

concatdbs Concatenate two DBs, giving new IDs to entries from 2nd DB

splitdb Split DB into subsets

mergedbs Merge entries from multiple DBs

subtractdbs Remove all entries from first DB occurring in second DB by key

Format conversion for downstream processing

convertalis Convert alignment DB to BLAST-tab, SAM or custom format

createtsv Convert result DB to tab-separated flat file

convert2fasta Convert sequence DB to FASTA format

result2flat Create flat file by adding FASTA headers to DB entries

createseqfiledb Create a DB of unaligned FASTA entries

taxonomyreport Create a taxonomy report in Kraken or Krona format

Sequence manipulation/transformation

extractorfs Six-frame extraction of open reading frames

extractframes Extract frames from a nucleotide sequence DB

orftocontig Write ORF locations in alignment format

reverseseq Reverse (without complement) sequences

translatenucs Translate nucleotides to proteins

translateaa Translate proteins to lexicographically lowest codons

splitsequence Split sequences by length

masksequence Soft mask sequence DB using tantan

extractalignedregion Extract aligned sequence region from query

Result manipulation

swapresults Transpose prefilter/alignment DB

result2rbh Filter a merged result DB to retain only reciprocal best hits

result2msa Compute MSA DB from a result DB

result2dnamsa Compute MSA DB with out insertions in the query for DNA sequences

result2stats Compute statistics for each entry in a DB

filterresult Pairwise alignment result filter

offsetalignment Offset alignment by ORF start position

proteinaln2nucl Transform protein alignments to nucleotide alignments

result2repseq Get representative sequences from result DB

sortresult Sort a result DB in the same order as the prefilter or align module

summarizealis Summarize alignment result to one row (uniq. cov., cov., avg. seq. id.)

summarizeresult Extract annotations from alignment DB

Taxonomy assignment

createtaxdb Add taxonomic labels to sequence DB

createbintaxonomy Create binary taxonomy from NCBI input

addtaxonomy Add taxonomic labels to result DB

taxonomyreport Create a taxonomy report in Kraken or Krona format

filtertaxdb Filter taxonomy result database

filtertaxseqdb Filter taxonomy sequence database

aggregatetax Aggregate multiple taxon labels to a single label

aggregatetaxweights Aggregate multiple taxon labels to a single label

lcaalign Efficient gapped alignment for lca computation

lca Compute the lowest common ancestor

majoritylca Compute the lowest common ancestor using majority voting

Multi-hit search

multihitdb Create sequence DB for multi hit searches

multihitsearch Search with a grouped set of sequences against another grouped set

besthitperset For each set of sequences compute the best element and update p-value

combinepvalperset For each set compute the combined p-value

mergeresultsbyset Merge results from multiple ORFs back to their respective contig

Prefiltering

prefilter Double consecutive diagonal k-mer search

ungappedprefilter Optimal diagonal score search

kmermatcher Find bottom-m-hashed k-mer matches within sequence DB

kmersearch Find bottom-m-hashed k-mer matches between target and query DB

Alignment

align Optimal gapped local alignment

alignall Within-result all-vs-all gapped local alignment

transitivealign Transfer alignments via transitivity

rescorediagonal Compute sequence identity for diagonal

alignbykmer Heuristic gapped local k-mer based alignment

Clustering

clust Cluster result by Set-Cover/Connected-Component/Greedy-Incremental

clusthash Hash-based clustering of equal length sequences

mergeclusters Merge multiple cascaded clustering steps

Profile databases

result2profile Compute profile DB from a result DB

msa2result Convert a MSA DB to a profile DB

msa2profile Convert a MSA DB to a profile DB

profile2pssm Convert a profile DB to a tab-separated PSSM file

profile2consensus Extract consensus sequence DB from a profile DB

profile2repseq Extract representative sequence DB from a profile DB

convertprofiledb Convert a HH-suite HHM DB to a profile DB

Profile-profile databases

enrich Boost diversity of search result

result2pp Merge two profile DBs by shared hits

profile2cs Convert a profile DB into a column state sequence DB

convertca3m Convert a cA3M DB to a result DB

expandaln Expand an alignment result based on another

expand2profile Expand an alignment result based on another and create a profile

Utility modules to manipulate DBs

view Print DB entries given in --id-list to stdout

apply Execute given program on each DB entry

filterdb DB filtering by given conditions

swapdb Transpose DB with integer values in first column

prefixid For each entry in a DB prepend the entry key to the entry itself

suffixid For each entry in a DB append the entry key to the entry itself

renamedbkeys Create a new DB with original keys renamed

Special-purpose utilities

diffseqdbs Compute diff of two sequence DBs

summarizetabs Extract annotations from HHblits BLAST-tab-formatted results

gff2db Extract regions from a sequence database based on a GFF3 file

maskbygff Mask out sequence regions in a sequence DB by features selected from a GFF3 file

convertkb Convert UniProtKB data to a DB

summarizeheaders Summarize FASTA headers of result DB

nrtotaxmapping Create taxonomy mapping for NR database

extractdomains Extract highest scoring alignment regions for each sequence from BLAST-tab file

countkmer Count k-mers

Bash completion for modules and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".

Include the location of the MMseqs2 binary in your "$PATH" environment variable.

#Version: 13.45111

>mmseqs databases

# mmseqs databases

usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]

Name Type Taxonomy Url

- UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref

- UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref

- UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref

- UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb

- UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb

- UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org

- NR Aminoacid yes https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA

- NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA

- GTDB Aminoacid yes https://gtdb.ecogenomic.org

- PDB Aminoacid - https://www.rcsb.org

- PDB70 Profile - https://github.com/soedinglab/hh-suite

- Pfam-A.full Profile - https://pfam.xfam.org

- Pfam-A.seed Profile - https://pfam.xfam.org

- Pfam-B Profile - https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released

- CDD Profile - https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

- eggNOG Profile - http://eggnog5.embl.de

- dbCAN2 Profile - http://bcb.unl.edu/dbCAN2

- SILVA Nucleotide yes https://www.arb-silva.de

- Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder

- Kalamari Nucleotide yes https://github.com/lskatz/Kalamari

options:

--compressed INT Write compressed output [0]

--threads INT Number of CPU-cores used (all by default) [12]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

- Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, 2020.11.27.401018 (2020)

Show an extended list of options by calling 'mmseqs databases -h'.

#mmseqs databases -hで詳細なヘルプ

Taxonomy欄でyesとなっているデータベースは、MMseqs2のデータベースであるseqTaxDBに完全に対応しており、各データベースのキーとそのtaxon IDのマッピングファイルが含まれている。つまり、対応したデータベースをダウンロードすれば、そのままクエリの配列に分類学のラベルを割り当てることができるために使用できる。MMseqs2のseqTaxDBに対して配列を検索した結果のデータベースは、taxonomyResultと呼ばれる（マニュアルより）。

実行方法

１、データベースのダウンロード

ここではseqTaxDBに対応しており、データベースサイズが小さいSILVAを使う。

mkdir tmp
mmseqs databases SILVA SILVA_database tmp

出力

f:id:kazumaxneo:20210915120557p:plain

２、クエリの配列（ここでは塩基配列）をデータベースに変換する。

mmseqs createdb inout.fasta queryDB

読み込みを高速化するために、追加でtargetDB のインデックスファイルを計算することもできる（mmseqs createindex queryDB tmp search-type 3）。

３、一時ディレクトリの作成（高いI/Oが必要なので大規模な検索ではSSDなどを使う。ただし、tmpには十分な空き容量がないといけない）。

ここではカレントに作成する。

mkdir tmp

４、ここでは塩基配列のクエリを塩基配列のデータベースに対して検索。

mmseqs search queryDB SILVA_database resultDB tmp --serach-type 3

-s sensitivity: 1.0 faster; 4.0 fast default; 7.5 sensitive (default 5.7)

５、結果のデータベースをBLASTタブ形式のファイルに変換する。

mmseqs convertalis queryDB SILVA_database resultDB result.m8

引用

Fast and sensitive taxonomic assignment to metagenomic contigs
M Mirdita, M Steinegger, F Breitwieser, J Söding, E Levy Karin
Bioinformatics, Published: 18 March 2021

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Steinegger M, Söding J

Nat Biotechnol. 2017 Nov;35(11):1026-1028

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.
Hauser M, Steinegger M, Söding J

Bioinformatics. 2016 May 1;32(9):1323-30.

2021-09-12

（Prokaryotes）ドラフトゲノムのポリッシングを行う Polypolish

polish bacteria repetitive sequences 高速なツール 2021 Preprint

2021 10/21 論文引用

Githubより

Polypolishはショートリードによるゲノムアセンブリを研磨するツールです。このカテゴリーの他のツールとは異なり、Polypolishは各リードが（単一の最適な位置ではなく）すべての可能な位置にアラインメントされたSAMファイルを使用します。これにより、他のアラインメントベースのポリッシャーでは修正できない繰り返し領域のエラーを修正することができます。

wiki

https://github.com/rrwick/Polypolish/wiki

2024/01/17

I've released a new version of Polypolish, my short-read polisher for long-read bacterial genomes: https://t.co/6eWZuWsNkp

Polypolish is now 100% Rust for better performance and easier installation, and I've added --careful for when you REALLY don't want false positives.
— Ryan Wick (@rrwick) 2024年1月17日

When you align short reads to a long-read genome assembly in the 'normal' one-alignment-per-read manner, you often get no coverage over errors in repeats. This is because reads will preferentially align to other error-free instances of the repeat.
(3/12) pic.twitter.com/30b2yII2uP
— Ryan Wick (@rrwick) October 18, 2021

But greedily combining different short-read polishers achieved MUCH lower error counts than were possible with any single tool. Clearly Polypolish can fix errors that other tools cannot and vice-versa.
(6/12) pic.twitter.com/shoM7mSKHX
— Ryan Wick (@rrwick) October 18, 2021

I've just released (during #MicroSeq2021) a new short-read polishing tool for fixing errors in long-read bacterial genome assemblies: Polypolish!https://t.co/6eWZuWJQmp
(1/8) pic.twitter.com/2EvI6s6Wya
— Ryan Wick (@rrwick) 2021年9月9日

ホモポリマーのエラー修正

https://github.com/rrwick/Polypolish/wiki/Alignment-trimming

インストール

Github

git clone https://github.com/rrwick/Polypolish.git
cd Polypolish
cargo build --release
cd target/release/

> ./polypolish

f:id:kazumaxneo:20210912202646p:plain

(version 0.4.3)

実行方法

１、ドラフトゲノムアセンブリのfastaにショートリードをマッピングする。

bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam

２、polypolishの実行

polypolish draft.fasta alignments_1.sam alignments_2.sam > polished.fasta

動作は非常に高速。5-Mbの細菌ゲノムでテストすると10秒程度で終了した。

FAQも用意されていて、真核生物ゲノムに使えるのかなど書かれています。興味ある方はアクセスしてみて下さい。

引用

https://github.com/rrwick/Polypolish

2021 10/21

Polypolish: short-read polishing of long-read bacterial genome assemblies
Ryan R. Wick, Kathryn E. Holt

bioRxiv, Posted October 16, 2021

エキソームのバリアント解析パイプライン EXOME-pipeline

snakemake human exome automated pipeline Variant annotations in VCF format SNV small indel

レポジトリより
このプロジェクトは、エクソームシーケンス用のSnakemakeを使った解析パイプラインです。
Illumina HiSeqからのヒトエクソームシーケンシングで広くテストされていますが、必要なリソースファイルを手動でダウンロードすれば、ほとんどのシステムや他の多くの生物種で動作するはずです。

パイプラインの主な手順は以下の通りです。

bwaによるリファレンスゲノムへのリードのアラインメント
picardによる重複リードのマーキング
GATK IndelRealignerによるindel周辺のリアラインメント
GATK BaseRecalibratorによる塩基の再校正
GATK HaplotypeCallerによるバリアントコーリング
gnomADによるpopulation frequenciesのアノテーション
Variant Effect Predictorによるアノテーション

いつものように試してみます。GATKはv3.7が使用されています。

インストール

condaで付属の.ymlファイルから仮想環境を作って導入した（ubuntu18.04）。

Github

git clone "https://gitlab.univ-nantes.fr/bird_pipeline_registry/exome-pipeline.git"
cd exome-pipeline
mamba env create -f CONDA/envExport.yml
conda activate exome

テストラン

CONFIGに保存されているjsonファイルを指定する。ランが終わるまでには10分程度かかる。

snakemake -C pro="CONFIG/project.test.json" ref="CONFIG/references.test.json" -rp testPipeline -j 4

出力

f:id:kazumaxneo:20210911203705p:plain

BED/

f:id:kazumaxneo:20210911203912p:plain

Samples/Sample1/

f:id:kazumaxneo:20210911203735p:plain

Samples/Sample1/VCF/

f:id:kazumaxneo:20210911203822p:plain

VCF/

f:id:kazumaxneo:20210911203822p:plain

実際にランするには　プロジェクト情報のjsonファイルとサンプル情報のJSONファイルを提供する必要があります。

CONFIG/project.test.json

f:id:kazumaxneo:20210911204223p:plain

CONFIG/references.test.json

f:id:kazumaxneo:20210911204323p:plain

VCFはbgzipで圧縮してindexが付いている必要があります。ゲノムfastaファイルの他に.dictファイルも必要にります（"samtools dict genome.fasta -o genome.dict"）。

サンプル情報のjsonファイルを作成するスクリプトも提供されています。ランするには、パイプライン実行時の出力ディレクトリのパス、プロジェクト名、入力fastqのパスを指定します。

python SCRIPTS/make_exome_project.py -o <path of output directory> -n <project name> -i <path to folder containing fastq files> > project.json

fastqはイルミナの命名則に従っている必要があります。

例；xxx_L001_R{1,2}_001.fastq.gz

このスクリプトをランするとCONFIG/project.jsonのようなJSONファイルが出力されるので、あとはreferences.jsonのテンプレートに自分で用意したリファレンスゲノムとアノテーションソースのパスを追加し、変数名をセットすればランできます。exampleのファイルを確認して下さい。

引用

bird_pipeline_registry / EXOME-pipeline · GitLab

2021-09-10

ゲノムアセンブリの品質、完全性、フェーズ評価を行う Merqury

2020 Genome Biology evaluation tool misassembly Trio binning assembly assembly large genome

　最近のロングリードアセンブリは、利用可能なリファレンスゲノムの品質と完全性を上回ることが多く、その検証は困難を極めている。ここでは、効率的なk-merセット操作に基づいてリファレンスフリーにアセンブリを評価する新しいツール、Merquryを紹介する。Merquryは、de novo アセンブリのk-merを、未アセンブリの高精度リードに含まれるk-merと比較することで、ベースレベルの精度と完全性を推定する。Triosの場合、Merquryはハプロタイプ固有の精度、完全性、フェーズブロックの連続性、スイッチエラーも評価できる。評価のために、k-merスペクトルプロットなどの複数の視覚化を生成することができる。ヒトと植物の両方のゲノムにおいて、Merquryがアセンブリ検証のための高速で堅牢な手法であることを実証する。

"Merqury: reference-free quality and phasing assessment for genome assemblies" from @ArangRhie Everything you need for k-mer-based assembly validation! https://t.co/zq5VZ7j8j7 pic.twitter.com/7TEFvjAgYZ
— Adam Phillippy (@aphillippy) March 17, 2020

Githubより

ゲノムアセンブリプロジェクトでは、アセンブルされた個体のイルミナ全ゲノムシークエンスリードが利用できることが多い。このリードセットからのk-merスペクトラムは、高品質のリファレンスを必要とせずに、アセンブリ品質の独立した評価に使用できる。Merquryは、この目的のために一連のツールを提供しています。

インストール

condaでpython3.9の仮想環境を作って導入した（ubuntu18.04）。

依存

gcc 7.4 or higher (for installing meryl)
meryl v1.3
Java run time environment (JRE)
R with argparse, ggplot2, and scales (recommend R 4.0.3+)
bedtools
samtools

Github

#conda(link)
mamba create -n merqury -c conda-forge -c bioconda merqury openjdk=11
conda activate merqury
mamba install -c conda-forge -c bioconda merqury

> merqury.sh -h

$ merqury.sh -h
Usage: merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
<read-db.meryl> : k-mer counts of the read set
<mat.meryl> : k-mer counts of the maternal haplotype (ex. mat.hapmer.meryl)
<pat.meryl> : k-mer counts of the paternal haplotype (ex. pat.hapmer.meryl)
<asm1.fasta> : Assembly fasta file (ex. pri.fasta, hap1.fasta or maternal.fasta)
[asm2.fasta] : Additional fasta file (ex. alt.fasta, hap2.fasta or paternal.fasta)
*asm1.meryl and asm2.meryl will be generated. Avoid using the same names as the hap-mer dbs
<out> : Output prefix
Arang Rhie, 2020-01-29. arrhie@gmail.com

実行方法

merqury.sh read-db.meryl mat.meryl asm1.fasta

引用

Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies
Arang Rhie, Brian P. Walenz, Sergey Koren & Adam M. Phillippy
Genome Biology volume 21, Article number: 245 (2020)

パンゲノム解析ツール PanACoTAのallコマンドを使う

pan-genome

　PanACoTAはモジュール方式のパイプラインなので、ゲノムの準備、品質チェックとフィルタリング、アノテーション、パンゲノムの計算、コア・persistant遺伝子の定義、系統解析まで順番に進めることができますが、allコマンド（説明）を使えば、全部のプロセスをまとめてランすることもできます。昨日に続いて、今日はこのallコマンドを使う手順を確認してみます。

昨日の手順で作った環境をアクティベートしておきます。

conda activate panacota

> PanACoTA all -h

# PanACoTA all -h

usage: PanACoTA all [-c CONFIGFILE] -o OUTDIR [--threads THREADS] [-T NCBI_SPECIES_TAXID] [-s NCBI_SPECIES] [-l LEVELS] [--cutn CUTN] [--l90 L90] [--nbcont NBCONT] [--prodigal] -n NAME [-i MIN_ID]

[--tol TOL] [-Mu] [-X] [--soft {fasttree,fastme,quicktree,iqtree,iqtree2}] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Run all PanACoTA modules

General arguments:

-c CONFIGFILE Path to your configuration file, defining values of parameters.

-o OUTDIR Path to your output folder, where all results from all 6 modules will be saved.

--threads THREADS Specify how many threads can be used (default=1)

'prepare' module arguments:

-T NCBI_SPECIES_TAXID

Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. A comma-separated list of taxid can also be provided.

-s NCBI_SPECIES Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")

-l LEVELS, --assembly_level LEVELS

Assembly levels of genomes to download (default: all). Possible levels are: 'all', 'complete', 'chromosome', 'scaffold', 'contig'.You can also provide a comma-separated

list of assembly levels. For ex: 'complete,chromosome'

Common arguments to 'prepare' and 'annotate' modules:

--cutn CUTN By default, each genome will be cut into new contigs when at least 5 'N' in a row are found in its sequence. If you don't want to cut genomes into new contigs when there

are rows of 'N', put 0 to this option. If you want to cut from a different number of 'N' in a row, put this value to this option.

--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.

--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.

'annotate' module arguments:

--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.

-n NAME Choose a name for your annotated genomes. This name should contain 4 alphanumeric characters. Generally, they correspond to the 2 first letters of genus, and 2 first

letters of species, e.g. ESCO for Escherichia Coli.

'pangenome' module arguments:

-i MIN_ID Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.

'corepers' module arguments:

--tol TOL min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the

minimum number of genomes will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.

-Mu Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to

allow exactly 1 member in 'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not

compatible.

-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is

not compatible with -M (which is allowing multigenic families: having several members in any number of genomes).

'tree' module arguments:

--soft {fasttree,fastme,quicktree,iqtree,iqtree2}

Choose with which software you want to infer the phylogenetic tree. Default is IQtree.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

手順

allコマンドでは、オプションでパラメータを指定することもできますが、全てのパラメータを記載したconfigファイルを使った方が細かくパラメータを調整できます。レポジトリのソースコード中にexampleディレクトリがあるので、レポジトリをcloneして取得します。

git clone https://github.com/gem-pasteur/PanACoTA.git

> ls -l PanACoTA/Examples/input_files/

f:id:kazumaxneo:20210909003532p:plain

configfile.iniがconfigで指定できる全てのパラメータです。見てみます。

PanACoTA/Examples/input_files/configfile.iniを開く

f:id:kazumaxneo:20210909120804p:plain

コメントを外して数値や文字を記載すれば認識します。パラメターのいくつかをアクティブにしてみました。

f:id:kazumaxneo:20210909120905p:plain

ランします。

PanACoTA all -c configfile.ini -o outdir -n test

出力

outdir/

f:id:kazumaxneo:20210909011334p:plain

上手くラン出来ています。

引用

PanACoTA: a modular tool for massive microbial comparative genomics

Amandine Perrin, Eduardo P.C. Rocha

NAR Genom Bioinform. 2021 Mar; 3(1): lqaa106.

2021-09-08

大規模な微生物の比較ゲノミクスのためのモジュラー式のツール PanACoTA

2021 Preprint docker 2020 NAR Genomics and Bioinformatics download Singularity pan-genome 2003

2021 9/8 修正

2021 9/9 誤字修正

2021 9/10 prokkaのバージョンによるエラー修正 (依存するライブラリの関係でpython3.7の環境に導入するように修正した), --prodigalのオプション消去

2021 10/15 docker imageのコマンド追記

2021 10/18 annotate追記

2021 10/28 pangenomeコマンド追記

　低コストのシークエンシングと数十万のゲノムが利用可能なことから、比較ゲノムは多くの微生物学者、遺伝学者、進化生物学者の基本的なツールキットとなっている。現在、多くの細菌種においてGenBank RefSeqリファレンスデータベースで100以上のゲノムが公開されており、中には1万以上のゲノムが公開されているものもある。この傾向は、シーケンシングにかかるコストの低下、ロングロード技術の利用可能性、診断や疫学のためのクリニックでの全ゲノムシーケンシングの利用などが進むにつれ、さらに増加すると考えられる。その結果、利用可能なデータを利用しようとする研究者は、解析すべき非常に大量のデータに直面している。比較ゲノム解析は、過去20年の間に細菌ゲノムの組織と進化の理解に重要な貢献をしてきた。比較ゲノムは疫学研究の標準的なツールとなっており、一連の菌株に共通する遺伝子（コアゲノムまたはpersistentゲノム(*１　下に補足説明)）を解析することで、関心のあるクローンの拡大を追跡する際に比類のない精度が得られる。診療での日常的なシークエンシングの使用には、単一の種から数千、近い将来には数百万のゲノムを検索するための迅速で信頼性の高い解析ツールがさらに必要になるだろう。集団遺伝学もまた、このような豊富なデータの恩恵を受けている。なぜなら、遺伝的に獲得した突然変異の起源と運命を詳細に追跡して、それが適応過程や突然変異過程の何を明らかにしているかを理解できるからである。最後に、ゲノムワイドな関連性研究は、一塩基多型や遺伝子レパートリーの変異を説明するために、最近、細菌遺伝学にも適用されている。これらの研究は、生物学者が関心のある表現型の遺伝的基盤を特定するのに役立つことが期待されている。細菌ゲノムにおける高い遺伝的連結性を考えると、これらの研究では、小さな影響を検出するために非常に大規模なデータセットを必要とする場合がある。より具体的には、reverse vaccinologyもまた、これらのパンゲノミクス手法の注目すべき応用であり、特定のクレードのコア表面露出タンパク質の中から新規の潜在的な抗原を同定するためのものである。

　大規模なゲノムデータセットの利用可能性は、研究者、特にバイオインフォマティクスの広範なトレーニングを受けていない研究者に大きな負担をかけている。最近では、類似度の高い配列に対して優れた想起率で配列の類似性を迅速に検索するためのツールが数多く開発されている。

　また、他のツールでは、配列類似度の高いファミリーに大量の配列を迅速にクラスタリングしたり、ゲノムのセットに共通するファミリーを取得したり、それらをアラインメントしたり、系統樹を作成したりする方法も提供されており、これは比較ゲノムの礎となっている。最近では、細菌のパンゲノムを計算するためのこれらのツールのいくつかを含む多くのプログラムが発表されている。これらのプログラムの多くは、非常に高速なプログラムを使用して、ファミリーのアラインメントやクラスタを計算する。いくつかは、DIAMOND、USEARCHおよびCD-HITのように、非常に高速であるために精度を犠牲にすることが知られているツールを使用している。後者は、現在パンゲノムを計算するための最もポピュラーなツールであるRoaryや、原核生物ゲノムの誤った自動アノテーションの影響を軽減することを目的とした非常に最近のツールであるPanarooなどによって使用されている。BPGAもまた、USEARCHやCD-HITを用いてタンパク質をクラスタリングし、いくつかの下流の解析を提供している。また、優れたグラフィカルインターフェースを持つPanXは、DIAMONDを用いて遺伝子間の類似性を検索する。

　さらに最近では、SonicParanoidはパンゲノムを構築するために高効率で正確なプログラムmmseqs2の使用を導入し、PPanGGOLiNは同じツールを使用してパンゲノムファミリーを頻度の観点から統計的に分類する方法を提供した。最近のプログラムの中には、PPanGGOLiNやPanarooのように、パンゲノムをさらに洗練させるためにグラフベースのアプローチを使用しているものもある。これに関しては、両ツールによる319個のKlebsiella pneumoniaeゲノムのデータセットの解析で、非常に類似した結果が得られている[ref.16]。PIRATEのような、遠いゲノム間のオルソログをクラスタリングするためのツールも開発された。しかし、これらのプログラムはすべて、ダウンロード、品質管理、アラインメント、系統推論など、比較ゲノムに不可欠な初期ステップと最終ステップの一部またはすべてを欠いている。これがPanACoTA（PANgenome with Annotations, COre identification, Tree and corresponding Alignments）の開発に拍車をかけた。公開されている膨大な量のゲノム情報を活用するためには、6つの主要な操作ブロックが必要である。(1) 特定のクレードのゲノムを自動的に収集する。コンティグの数が多すぎるドラフトゲノムを避けるために、ある程度の品質管理が必要である。また、計算コストや擬似複製によるバイアスを最小限に抑えるために、ゲノムが冗長になりすぎていないかどうかをチェックしておくと便利なことが多い。一方で、細菌種（または関連性のある分類学的組織）の観点から誤分類されたゲノムを排除するために、ゲノムがあまりにも無関係ではないことをチェックすることが重要である。(2) 先験的に統一された命名法とアノテーションを定義しておく。(3) ゲノムアセンブリ中の各遺伝子ファミリーの有無のパターンの行列であるパンゲノムを、正確、簡便、高速な方法で作成する。(4) パンゲノムを用いて、コア遺伝子やpersistentな遺伝子の集合を同定する。(5) コアまたはpersistentなゲノムの遺伝子ファミリーの複数のアラインメントを作成する。(6) 最後に、コア/persistent遺伝子のセットの系統を、合理的に正確に迅速に作成する。これらの4つのデータ、パンゲノム、コアゲノム、アラインメント、系統樹は、ほとんどの微生物比較ゲノム研究の基礎となっている。このプロセスの最後には、研究者は、関心のある問題に特化した、より詳細な解析を行うことができ、多くの場合、分類群の追加/除外、配列類似度の限界の変更、アラインメントの精度の向上、または異なる方法を用いた系統樹の再構築などの変更につながる。このような再定義は、パイプラインがモジュール化されており、プロセスのいくつかの重要なポイントで解析を再開始できるようになっていれば、より効率的に行うことができる。

　微生物比較ゲノミクスのためのパイプラインが現在利用可能であることを考慮して、モジュール化されており、セットアップが容易で、最先端のツールを使用し、中間結果の再利用が簡単にできるパイプラインを構築した。目標は、ある分類群からすべてのゲノムをダウンロードし、基本的な比較ゲノムを自動的に動作させることができるパイプラインを提供することだった。パイプラインは完全に単一言語であるPython v3で構築されており、将来のメンテナンスを容易にし、不要な動作を制限するために最新の方法を使用している。PanACoTAはオープンソースのGNU AGPLライセンスで自由に利用できる。ここでは、Klebsiella pneumoniaeの225個の完全長ゲノムと3980個の完全長ゲノムまたはドラフトゲノムの2つのデータセットの解析を用いて、この方法を説明する。この種は、多くのゲノムが利用可能であり、非常にオープンなパンゲノムであるため、著者らの目的には興味深い種である。最初のデータセットは、通常は配列の品質が高い場合を示しており、2番目のデータセットは、配列の品質が低かったり、完全なアセンブリがないために遺伝子が断片化されていたりするような非常に大規模なデータセットに対して、この方法がどのようにスケールアップしていくかを示している。手法の詳細は「方法」のセクションに、その使用方法と、2つのデータセットの主要なオプションとの関係でどのように変化するかの説明は「結果」のセクションに詳しく書かれている。

　PanACoTaは、以下の6つのセクションで説明する6つのシーケンシャルモジュールで実装され、以前のモジュールを必要とせずにモジュールを使用できるように設計されている。これにより、任意のステップで解析を開始または停止し、他のパラメータで解析を再実行することができる（論文図1参照）。

Documentation

https://aperrin.pages.pasteur.fr/pipeline_annotation/html-doc/

インストール

condaでpython3.9python3.7の仮想環境を作ったあと、pipで導入した（ubuntu18.04）。他のツールはcondaで導入した（注意；アドホックなやり方に過ぎず、インストールできるかどうか再現性がないかもしれません）。

Github

#conda(link)　ここではcondaの代わりにmambaを使用
mamba create -n panacota python=3.7 -y
conda activate panacota
#PanACoTAのインストール
#mamba install -c bioconda panacota -y
#prokka,mmseqs2,mashも必要（prokkaのバージョンによるエラーが報告されているので注意#9）
mamba install -c bioconda prokka==1.14.5 -y
mamba install -c bioconda mmseqs2 -y
mamba install -c bioconda mash -y
#iqtreeはversion1系をインストールする
mamba install -c bioconda iqtree=1.6 -y

#fasttree
mamba install -c bioconda/label/cf201901 fasttree -y

#PanACoTAはpipでもインストール可能（condaとおなじで依存するツールは別に導入する）
pip install panacota

#dockerとsingularityにも対応している
#docker(link)
docker pull gempasteur/panacota

> PanACoTA -h

$ PanACoTA -h

usage: PanACoTA [-h] [-V]

{all,prepare,annotate,pangenome,corepers,align,tree} ...

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

positional arguments:

{all,prepare,annotate,pangenome,corepers,align,tree}

all Run all PanACoTA modules

prepare Prepare draft dataset

annotate Quality control and annotation of genomes

pangenome Generate a pan-genome of your dataset

corepers Compute a Core or Persistent genome of your dataset

align Align Core/Persistent families

tree Infer phylogenetic tree based on core/persistent genome

optional arguments:

-h, --help show this help message and exit

-V, --version Print the version number and exit

For more details, see PanACoTA documentation.

> PanACoTA prepare -h

# PanACoTA prepare

usage: PanACoTA [-h] [-V] {all,prepare,annotate,pangenome,corepers,align,tree} ...

PanACoTA: error: As you did not put the '--norefseq' nor the '-M' option, it means that you want to download refseq (or genbank) genomes. But you did not provide any information, so PanACoTA cannot guess which species you want to download. Specify NCBI_taxid (-t), and/or NCBI species taxid (-T) and/or NCBI_species (-g) to download, or add one of the 2 options (--norefseq or -M) if you want to skip the 'download step'.

(base) root@81a7aab948d6:/home# PanACoTA prepare -h

usage: PanACoTA prepare [-T NCBI_SPECIES_TAXID] [-t NCBI_TAXID] [-g NCBI_SPECIES_NAME] [-s {refseq,genbank}] [-l LEVELS] [-o OUTDIR] [--tmp TMP_DIR] [--cutn CUTN] [--l90 L90] [--nbcont NBCONT] [--min_dist MIN_DIST]

[--max_dist MAX_DIST] [-p PARALLEL] [--norefseq] [-d DB_DIR] [-M] [--info INFO_FILE] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Prepare draft dataset

General arguments:

-T NCBI_SPECIES_TAXID

Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. This will download all sequences of this species and all its sub-species.A comma-separated list of species

taxids can also be provided. (Ex: -T 573 for Klebsiella pneumoniae)

-t NCBI_TAXID Taxid to download. This can be the taxid of a sub-species, or of a specific strain. A taxid of a subspecies will download all strains in this subspecies EXCEPT the ones which have a specific

taxid.A comma-separated list of taxids can also be provided.Ex: '-t 72407' will download all 'general' K. pneumoniae subsp. pneumoniae strains, and '-t 1123862' will download the strain K.

pneumoniae subsp. pneumoniae Kp13 (not included in -t 72407, as it is a strain of the subspecies with a specific taxid).

-g NCBI_SPECIES_NAME Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")

-s {refseq,genbank} NCBI section to download: all genbank, or only refseq (default)

-l LEVELS, --assembly_level LEVELS

Assembly levels of genomes to download (default: all). Possible levels are: 'all', 'complete', 'chromosome', 'scaffold', 'contig'.You can also provide a comma-separated list of assembly levels.

For ex: 'complete,chromosome'

-o OUTDIR Give the path to the directory where you want to save the downloaded database. In the given directory, it will create a folder 'Database_init' containing all fasta files that will be sent to the

control procedure, as well as a folder 'refseq' with all original compressed files downloaded from refseq. By default, this output dir name is the ncbi_species name if given, or ncbi_species_taxid

or ncbi_taxid otherwise.

--tmp TMP_DIR Specify where the temporary files (sequence split by rows of 'N', sequence with new contig names etc.) must be saved. By default, it will be saved in your out_dir/tmp_files.

--cutn CUTN By default, each genome will be cut into new contigs when at least 5 'N' in a row are found in its sequence. If you don't want to cut genomes into new contigs when there are rows of 'N', put 0 to

this option. If you want to cut from a different number of 'N' in a row, put this value to this option.

--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.

--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.

--min_dist MIN_DIST By default, genomes whose distance to the reference is not between 1e-4 and 0.06 are discarded. You can specify your own lower limit (instead of 1e-4) with this option.

--max_dist MAX_DIST By default, genomes whose distance to the reference is not between 1e-4 and 0.06 are discarded. You can specify your own lower limit (instead of 0.06) with this option.

-p PARALLEL, --threads PARALLEL

Run 'N' downloads in parallel (default=1). Put 0 if you want to use all cores of your computer.

Alternatives:

--norefseq If you already downloaded refseq genomes and do not want to check them, add this option to directly go to the next steps:quality control (L90, number of contigs...) and mash filter. Don't forget

to specify the db_dir (-d option) where you already have your genome sequences.

-d DB_DIR If your already downloaded sequences are not in the default directory (outdir/Database_init), you can specify here the path to those fasta files.

-M, --only-mash Add this option if you already downloaded complete and refseq genomes, and ran quality control (you have, a file containing all genome names, as well as their genome size, number of contigs and

L90 values). It will then get information on genomes quality from this file, and run mash steps.

--info INFO_FILE If you already ran the quality control, specify from which file PanACoTA can read this information, in order to proceed to the mash step. This file must contain at least 4 columns, tab separated,

with the following headers: 'to_annotate', 'gsize', 'nb_conts', 'L90'. Any other column will be ignored.

Others:

-v, --verbose Increase verbosity in stdout/stderr and log files.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA annotate -h

# PanACoTA annotate -h

usage: PanACoTA annotate [-d DB_PATH] -r RES_PATH [-l LIST_FILE] [-n NAME] [-Q] [--info FROM_INFO] [--prodigal] [--small] [--l90 L90] [--nbcont NBCONT] [--cutn CUTN] [--date DATE] [--tmp TMPDIR] [--annot_dir ANNOTDIR]

[-F] [--threads THREADS] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Quality control and annotation of genomes

Required arguments:

-d DB_PATH Path to folder containing all multifasta genome files

-r RES_PATH Path to folder where output annotated genomes must be saved

Optional arguments:

-l LIST_FILE File containing the list of genome filenames to annotate (1 genome per line). Each genome must be in multi-fasta format. You can specify the species name (4 characters) you want to give to some

genome, as well as the download (or any other reason) date of your choice. Format 'genome_name :: name.date'. name and date are optional. See doc for more information on this file format. If you

want to run this module from 'prepare_module' results, use '--info' instead.

-n NAME Choose a name for your annotated genomes. This name should contain 4 alphanumeric characters. Generally, they correspond to the 2 first letters of genus, and 2 first letters of species, e.g. ESCO

for Escherichia Coli.

-Q Add this option if you want only to do quality control on your genomes (cut at 5N if asked, calculate L90 and number of contigs and plot their distributions). This allows you to check which

genomes would be annotated with the given parameters, and to modify those parameters if you want, before you launch the annotation and formatting steps.

--info FROM_INFO If you already ran the 'prepare' data module, or already calculated yourself the L90 and number of contigs for each genome, you can give this information, to go directly to annotation and

formatting steps. This file contains at least 4 columns, tab separated, with the following headers: 'to_annotate', 'gsize', 'nb_conts', 'L90'. Any other column will be ignored.

--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.

--small If you use Prodigal to annotate genomes, if you sequences are too small (less than 20000 characters), it cannot annotate them with the default options. Add this option to use 'meta' procedure.

--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.

--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.

this option. If you want to cut from a different number of 'N' occurrences, put this value to this option.

--date DATE Specify the date (MMYY) to give to your annotated genomes. By default, will give today's date. The only requirement on the given date is that it is 4 characters long. You can use letters if you

want. But the common way is to use 4 digits, corresponding to MMYY.

--tmp TMPDIR Specify where the temporary files (sequence split by rows of 'N', sequence with new contig names etc.) must be saved. By default, it will be saved in your result_directory/tmp_files.

--annot_dir ANNOTDIR Specify in which directory the prokka/prodigal output files (1 folder per genome, called <genome_name>-[prokka, Prodigal]Res) must be saved. By default, they are saved in the same directory as

your temporary files (see --tmp option to change it).

-F, --force Force run: Add this option if you want to (re)run annotation and formatting steps for all genomes even if their result folder (for annotation step) or files (for format step) already exist:

override existing results. Without this option, if there already are results in the given result folder, the program stops. If there are no results, but prokka/prodigal folder already exists,

prokka/prodigal won't run again, and the formating step will use the already existing folder if correct, or skip the genome if there are problems in prokka/prodigal folder.

--threads THREADS Specify how many threads can be used (default=1)

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA pangenome -h

# PanACoTA pangenome -h

usage: PanACoTA pangenome -l LSTINFO_FILE -n DATASET_NAME -d DBPATH -o OUTDIR [-i MIN_ID] [-f OUTFILE] [-c {0,1,2}] [-s SPEDIR] [--threads THREADS] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Generate a pan-genome of your dataset

Required arguments:

-l LSTINFO_FILE File containing the list of all genomes to include in the pan-genome, 1 genome per line: it can be the LSTINFO-<list_file>.lst file of 'PanACoTA annotate' module.Here, only the first column (genome

name without extension) will be used. All proteins of all these genomes will be concatenated in a file called <dataset_name>.All.prt. The column header must be 'gembase_name'.

-n DATASET_NAME Name of the dataset which will be clustered (for example, SAEN1234 for 1234 Salmonella enterica genomes). This name will be used to name the protein databank, a well as the pangenome files.

-d DBPATH Path to the folder containing all protein files corresponding to the genomes of the dataset (output directory 'Proteins' of 'PanACoTA annotate' module).

-o OUTDIR Output directory, where all results must be saved (including tmp folder)

Optional arguments:

-i MIN_ID Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.

-f OUTFILE Use this option if you want to give the name of the pangenome output file (without path). Otherwise, by default, it is called PanGenome-

mmseq_<given_dataset_name>.All.prt_<information_on_parameters>.lst

-c {0,1,2} Choose the clustering mode: 0 for 'set cover', 1 for 'single-linkage', 2 for 'CD-Hit'. Default is 'single-linkage' (1)

-s SPEDIR use this option if you want to save the concatenated protein databank in another directory than the one containing all individual protein files ('Proteins' folder).

--threads THREADS add this option if you want to parallelize on several threads. Indicate on how many threads you want to parallelize. By default, it uses 1 thread. Put 0 if you want to use all threads of your

computer.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA corepers -h

# PanACoTA corepers -h

usage: PanACoTA corepers -p PANGENOME -o OUTPUTDIR [-t TOL] [-M] [-X] [-F] [-l LSTINFO_FILE] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Compute a Core or Persistent genome of your dataset

Required arguments:

-p PANGENOME PanGenome file (1 line per family, first column is fam number)

-o OUTPUTDIR Specify the output directory for your core/persistent genome.

Optional arguments:

-t TOL, --tol TOL min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the minimum number of genomes

will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.

-M Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to allow exactly 1 member in

'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not compatible.

-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is not compatible with -M

(which is allowing multigenic families: having several members in any number of genomes).

-F When you specify the '-tol' option, with a number lower than 1, you can add this option to use floor('tol'*N) as a minimum number of genomes instead of ceil('tol'*N) which is the default behavior.

-l LSTINFO_FILE By default, the core/persistent genome will include all genomes found in the given pangenome file. If you want to do a core/persistent genome on a subset of those genomes, give a file with this list

of genomes. This file must have 1 line per genome, only the first column (genome name without extension) will be used.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA align -h

# PanACoTA align -h

usage: PanACoTA align -c COREPERS -l LIST_GENOMES -n DATASET_NAME -d DBPATH -o OUTDIR [--threads THREADS] [-F] [-P] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Align Core/Persistent families

Required arguments:

-c COREPERS Core or persistent genome whose families must be aligned.

-l LIST_GENOMES File containing the list of all the genomes you want to align from their core/persistent families. 1 genome per line: it can be the LSTINFO-<list_file>.lst file of 'PanACoTA annotate' module. Here,

only the first column (genome name without extension) will be used. The final alignment file will contain 1 alignment per genome in this file.

-n DATASET_NAME Name of the dataset which will be aligned (for example, SAEN1234 for 1234 Salmonella enterica genomes). This name will be used to name the alignment file.

-d DBPATH Path to the folder containing the directories 'Proteins' and 'Genes', created by 'PanACoTA annotate'.

-o OUTDIR Output directory, where all results must be saved

Optional arguments:

computer.

-F, --force Force run: Add this option if you want to redo all alignments for all families, even if their result file already exists. Without this option, if an alignment file already exists, it will be used for

the next step. If you want to redo only a given alignment, just delete its file, without using this option.

-P Add this option if you also need the aa alignment of the concatenation of all persistent proteins. By default, PanACoTA only gives the nucleic alignment.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA tree -h

# PanACoTA tree -h

usage: PanACoTA tree -a ALIGNMENT -o OUTDIR [-s {fasttree,fastme,quicktree,iqtree,iqtree2}] [-b BOOT] [--threads THREADS] [-m MODEL] [-B] [--mem MEMORY] [-fast] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Infer phylogenetic tree based on core/persistent genome

Required arguments:

-a ALIGNMENT Alignment file in multi-fasta: each header will be a leaf of the inferred tree.

-o OUTDIR Directory where tree results will be saved.

Choose soft to use (default is IQtree2):

-s {fasttree,fastme,quicktree,iqtree,iqtree2}, --soft {fasttree,fastme,quicktree,iqtree,iqtree2}

Choose with which software you want to infer the phylogenetic tree. Default is IQtree2 (versions 2.x of IQtree). If you want version 1.x of IQtree, use '-s iqtree'

Optional arguments:

-b BOOT, --boot BOOT Indicate how many bootstraps you want to compute. By default, no bootstrap is calculated. For IQtree, it will use ultrafast bootstrap (>=1000).

computer. Not available with quicktree.

-m MODEL, --model MODEL

Choose your DNA substitution model. Default for FastTree and IQtree: GTR. Default for FastME: TN93. For FastTree, the choices are 'GTR' and 'JC'. For FastME, choices are: 'p-distance' (or 'p'),

'RY symmetric' (or 'Y'), 'RY' (or 'R'), 'JC69' (or 'J'), 'K2P' (or 'K'), 'F81' (or '1'), 'F84' (or '4'), 'TN93' (or 'T'), 'LogDet' (or 'L'). For IQtree, choices are HKY, JC, F81, K2P, K3P, K81uf,

TNef, TIM, TIMef, TVM, TVMef, SYM, GTR, TEST. TEST to run standard model selection.

-B Add this option if you want to write all bootstrap pseudo-trees. Only available with FastME and IQtree.

--mem MEMORY Maximal RAM usage in GB | MB. Only available with iqtree.

-fast Use -fast option with iqtree.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

> PanACoTA all -h

# PanACoTA all -h

usage: PanACoTA all [-c CONFIGFILE] -o OUTDIR [--threads THREADS] [-T NCBI_SPECIES_TAXID] [-s NCBI_SPECIES] [-l LEVELS] [--cutn CUTN] [--l90 L90] [--nbcont NBCONT] [--prodigal] -n NAME [-i MIN_ID] [--tol TOL] [-Mu] [-X]

[--soft {fasttree,fastme,quicktree,iqtree,iqtree2}] [-v] [-q] [-h]

___ _____ ___ _____ _____

( _`\ ( _ )( _`\ (_ _)( _ )

| |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

| ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

| | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

(_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

Large scale comparative genomics tools

-------------------------------------------

=> Run all PanACoTA modules

General arguments:

-c CONFIGFILE Path to your configuration file, defining values of parameters.

-o OUTDIR Path to your output folder, where all results from all 6 modules will be saved.

--threads THREADS Specify how many threads can be used (default=1)

'prepare' module arguments:

-T NCBI_SPECIES_TAXID

Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. A comma-separated list of taxid can also be provided.

-s NCBI_SPECIES Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")

-l LEVELS, --assembly_level LEVELS

Assembly levels of genomes to download (default: all). Possible levels are: 'all', 'complete', 'chromosome', 'scaffold', 'contig'.You can also provide a comma-separated list of assembly levels.

For ex: 'complete,chromosome'

Common arguments to 'prepare' and 'annotate' modules:

this option. If you want to cut from a different number of 'N' in a row, put this value to this option.

--l90 L90 Maximum value of L90 allowed to keep a genome. Default is 100.

--nbcont NBCONT Maximum number of contigs allowed to keep a genome. Default is 999.

'annotate' module arguments:

--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.

for Escherichia Coli.

'pangenome' module arguments:

-i MIN_ID Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.

'corepers' module arguments:

--tol TOL min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the minimum number of genomes

will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.

-Mu Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to allow exactly 1 member in

'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not compatible.

-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is not compatible with -M

(which is allowing multigenic families: having several members in any number of genomes).

'tree' module arguments:

--soft {fasttree,fastme,quicktree,iqtree,iqtree2}

Choose with which software you want to infer the phylogenetic tree. Default is IQtree.

Others:

-v, --verbose Increase verbosity in stdout/stderr.

-q, --quiet Do not display anything to stdout/stderr. log files will still be created.

-h, --help show this help message and exit

For more details, see PanACoTA documentation.

実行方法

１、２のPrepareとannotate コマンドを使うと、アセンブリ品質と分類の観点から要件に適合しないゲノムを排除しながらゲノムを集めることができる。

１、prepare - ゲノムのダウンロードとクオリティチェック

PanACoTAの最初のモジュールであるprepareを使うと、与えられたNCBI taxonomy IDから全ゲノムを取得することができる。このコマンドは、refseqやgenbankからゲノムをダウンロードし、Mash遺伝的距離に基づいて、冗長なゲノムや分類ミスしたゲノムをフィルタリングして保存することができる。

PanACoTA prepare -T 104099 -g "Acetobacter orleanensis" -p 1

-T Species taxid to download, corresponding to the 'species taxid' provided by the NCBI. This will download all sequences of this species and all its sub-species.A comma-separated list of species taxids can also be provided. (Ex: -T 573 for Klebsiella pneumoniae)

-t Taxid to download. This can be the taxid of a sub-species, or of a specific strain. A taxid of a subspecies will download all strains in this subspecies EXCEPT the ones which have a specific taxid.A comma-separated list of taxids can also be provided.Ex: '-t 72407' will download all 'general' K. pneumoniae subsp. pneumoniae strains, and '-t 1123862' will download the strain K. pneumoniae subsp. pneumoniae Kp13 (not included in -t 72407, as it is a strain of the subspecies with a specific taxid).
-g Species to download, corresponding to the 'organism name' provided by the NCBI. Give name between quotes (for example "escherichia coli")
-s {refseq, genbank} NCBI section to download: all genbank, or only refseq (default)
-l Assembly levels of genomes to download (default: all).
--cutn By default, each genome will be cut into new contigs when at least 5 'N' in a row are found in its sequence. If you don't want to cut genomes into new contigs when there are rows of 'N', put 0 to this option. If you want to cut from a different number of 'N' in a row, put this value to this option.
--l90 Maximum value of L90 allowed to keep a genome. Default is 100.
--min_dist By default, genomes whose distance to the reference is not between 1e-4 and 0.06 are discarded. You can specify your own lower limit (instead of 1e-4) with this option.
--max_dist By default, genomes whose distance to the reference is not between 1e-4 and 0.06 are discarded. You can specify your own lower limit (instead of 0.06) with this option.
-p Run 'N' downloads in parallel (default=1). Put 0 if you want to use all cores of your computer.

出力

f:id:kazumaxneo:20210908005228p:plain

refseq/bacteria/にはアセンブリごとに1つのフォルダが作られ、fasta.gz形式のアセンブリ配列とMD5SUMSが入っている。 Database_init/にはダウンロードしたすべてのアセンブリがfasta形式で含まれ、tmp_files/にはNが5以上続くストレッチで分割されたアセンブリがfasta形式で含まれている。assembly_summary-Vibrio_diabolicus.txtにはダウンロードした全ゲノムのリストと種名やIDなどが含まれている。PanACoTAは、廃棄されたゲノムとその理由をリストアップしたファイルも提供する。discarded-by-L90_nbcont-Acetobacter_orleanensis.lstファイルは品質管理ステップで廃棄されたゲノムのリストが含まれ、discarded-by-minhash-Acetobacter_orleanensis-0.0001_0.06.txtには、フィルタリングステップで廃棄されたゲノムのリストが含まれている。LSTINFO-Acetobacter_orleanensis-filtered-0.0001_0.06.txtには、チェックをパスしたゲノムのリストとコンティグ数、ゲノムサイズ、L904が保存される。

上の例では、４つのゲノムがダウンロードされ、MASH距離で遠かった3つのゲノムが排除と判断された (近すぎるゲノムも排除され得る)。フィルタリングをパスしたファイルだけ別に書き出されるわけではない。

フィルタリングをパスしたゲノムは LSTINFO～を確認する。
> cat LSTINFO-Acetobacter_orleanensis-filtered-0.0001_0.06.txt

refseqに登録されているゲノムだけでなく、genbankに登録されているすべてのゲノムをダウンロードしたい場合は、"-s genbank"オプションを使う。デフォルトは"-s refseq"でrefseqアセンブリはNCBIでのチェックが入っている分信頼性は高いが、最近ゲノムの登録が加速しており、最近数か月か1年以内に登録されたものだと、genbank アセンブリのみあってrefseqアセンブリは利用できないゲノムもかなりあったりするので注意する。

特定のアセンブリレベルのゲノムのみをダウンロードしたい場合は、オプション”-l”を使う（デフォルトは'all'）；ダウンロードしたいアセンブリレベルを'all', 'complete', 'chromosome', 'scaffold', 'contig'の間でコンマで区切って指定する。

ダウンロードは省略して、手持ちのゲノムをフィルタリングするためにprepareモジュールを使用することもできます。Documentを読んで下さい。

手持ちのゲノムをチェックして使うこともできる。

PanACoTA prepare --norefseq -o outdir -d your_genome_dir/

#閾値をゆるくする
PanACoTA  prepare --norefseq -o outdir -d your_genome_dir --min_dist 0 --max_dist 1 --cutn 0 --l90 1000

マニュアルとpreprintより

　ユーザーが同じ種のゲノムを解析していると仮定すると、それらのゲノムはコンティグの数と長さの点で比較的類似した特徴を持っているはずです。そのため、PanACoTAはまず、各ゲノムについて、コンティグの総数とL90（ゲノム全体の少なくとも90%を取得するために必要なコンティグの最小数）という2つの重要な指標を計算します。これら2つの変数の値が非常に高いということは、通常、シーケンシングやアセンブルの質が低いことを示しています。これらの値が高いと、多くの切り捨てられた遺伝子のアノテーションが行われ、パンゲノムのサイズが不自然に大きくなってしまいます（実際の遺伝子は、異なるファミリーに分類される多数のオープンリーディングフレームに分割されているため）。また、このような貧弱に組み立てられたゲノムは、小さなコンティグに支配されているため、遺伝的連鎖を研究することが重要な場合には、比較ゲノムの研究を複雑にします。しきい値はユーザーが指定することができます。デフォルトでは1000コンティグ以下、L90は100以下に設定されています。

　手順の第二部分は、冗長なゲノムや分類ミスのあるゲノムを除去するための専用のフィルターです。これは、Mashによって計算されたゲノムのペア間の遺伝的距離に基づいて行われます。この距離フィルタリングステップにMashを選んだのは、非常に高速に計算でき、密接に関連したゲノムに対しても正確であるからです。Mashは、MinHash技術を用いて、各ゲノム配列を代表的なk-merのスケッチに変換します。次に、完全な配列ではなく、これらのスケッチを比較します。この出力Mash距離Dは、ブラストアルゴリズムを用いた全ゲノム配列比較に基づく平均ヌクレオチド同一性(ANI)のようなアラインメントベースの尺度と強く相関します。D≈1 - ANI。ANIが90-100%の範囲であれば、スケッチサイズを大きくするとMash距離との相関はさらに強くなります。パンゲノムは通常、単一の細菌種に対して計算されるので、ここでは少なくとも94%の同一性を持つゲノムを識別するためにMashを使用しています。最近のプログラムでは、Mashよりもわずかに高い精度を示すものがいくつか発表されていますが、システマティックなフィルタとして使用するには遅すぎることがわかりました。

　細菌種は通常、94%以上の同一性を持つゲノムのグループとして定義され、これは通常、Dの上限値として使用されます(max_mash_distは変更可能なパラメータで、デフォルトでは0.06に固定されています)。もう一方の極端な例として、非常に高い類似性を持つ（低いマッシュ距離に対応する）ゲノムは、非常に類似した情報を提供します。これらを除外することで、解析に必要な計算資源を減らし、特定のクレードでのオーバーサンプリングを減少させる（＊メジャーな病原性細菌など）。PanACoTAでは、デフォルトでmin_mash_distを10^-4に設定していますが、このパラメータはユーザーが指定することもできます。この距離は、平均して10遺伝子ごとの1ポイントの変化を表しており、多くのドラフトゲノムのシーケンシングとアセンブルの精度に近いかもしれません。

２、annotate - ゲノムの均一なアテーション

PanACoTAの２つ目のモジュールであるannotateを使ってゲノムにアノテーションをつける。アノテーションの入力は、(multi)fasta形式のゲノム配列で、<db_path>で指定されたディレクトリに保存されている必要がある。ただし、提供するリストファイルにかかれたファイルのみ対象になるので、<db_path>で指定されたディレクトリにゲノム配列以外のファイルが含まれていても問題ない。

リストファイル（list_genomes.ls）には、アノテートするゲノム名を1行ずつ書く。リストファイルにはファイルのパスを書くわけではないので注意する（ゲノムファイルのパスは-d <pasth>で指定）。

genome1.fst
genome2.fst :: GEN2
genome3-chromo.fst genome3-plasmid.fst :: .1216

リストファイルとリストのファイルのパス、出力ディレクトリ、名前 (4文字) などを指定する。リストファイルは、prepareの出力のフィルタリングをパスしたゲノムのリストファイルである”LSTINFO～”のパス部分だけを消して使うこともできる（2列目以降は無視される）。"

-r annotate"は出力ディレクトリ。デフォルトでは、L90（全長の長さの90%に達した時のコンティグ数）が100以上のゲノムはアノテーションされずに捨てられる。これはそれなりに厳しい閾値なので、連続性の低いドラフトゲノムはかなり捨てられる事になる。例えば200に変えるには”--l90 200”とタイプする。

PanACoTA annotate -l list -d genome_dir -r annotate -n <name> --threads 20 --l90 100

-d DB_PATH Path to folder containing all multifasta genome files
-l LIST_FILE File containing the list of genome filenames to annotate (1 genome per line). Each genome must be in multi-fasta format. You can specify the species name (4 characters) you want to give to some genome, as well as the
download (or any other reason) date of your choice. Format 'genome_name :: name.date'. name and date are optional. See doc for more information on this file format. If you want to run this module from 'prepare_module' results, use '--info' instead.
-r RES_PATH Path to folder where output annotated genomes must be saved
-Q Add this option if you want only to do quality control on your genomes (cut at 5N if asked, calculate L90 and number of contigs and plot their distributions). This allows you to check which genomes would be annotated with the given parameters, and to modify those parameters if you want, before you launch the annotation and formatting steps.
--nbcont Maximum number of contigs allowed to keep a genome. Default is 999.
--l90 Maximum value of L90 allowed to keep a genome. Default is 100.
-n Choose a name for your annotated genomes. This name should contain 4 alphanumeric characters. Generally, they correspond to the 2 first letters of genus, and 2 first letters of species, e.g. ESCO for Escherichia Coli.
--prodigal Add this option if you only want syntactical annotation, given by prodigal, and not functional annotation requiring prokka and is slower.
--small If you use Prodigal to annotate genomes, if you sequences are too small (less than 20000 characters), it cannot annotate them with the default options. Add this option to use 'meta' procedure.
--threads Specify how many threads can be used (default=1)

出力

f:id:kazumaxneo:20210908023248p:plain

このステップの出力は、ゲノムごとに5つのファイルで構成されている：オリジナルの配列、遺伝子、タンパク質（すべてfasta形式）、すべてのアノテーションを含むgffファイル、およびサマリー情報ファイル、となる。例えばgff3/には各ゲノムからのgff3形式のアノテーションされたファイルだけが保存されている。

QC_L90-list_genomes.png

f:id:kazumaxneo:20210908023316p:plain

-Qオプションを付けて実行すると、アノテーションは無しでクオリティチェックのみ行われる。

マニュアルとpreprintより

　ゲノムのアノテーションは、PanACoTAのアノテートモジュールによって行われます。入力は、prepareモジュールまたはユーザーから直接提供されたfasta配列のデータベースです。これらのゲノムの品質管理（コンティグ数や L90）に関する情報がない場合は、ここで品質管理を行います（品質管理の詳細については前のセクションを参照してください）。このモジュールの目的は、データセット全体の遺伝子位置（および機能）の均一なアノテーションを提供することであるため、PanACoTA上でProdigalを使用して全てのゲノムの遺伝子位置を特定し、Prokkaを用いて全てのゲノムを均一にアノテーションします。また、その際には、BLAST+を用いてUniprotから取得したタンパク質のデータベース内のホモログを検索したり、HMMER3を用いてTIGRFAMやPFAMから選択したプロファイルに合致するタンパク質を検索したりと、一連のプログラムを用いて機能的なアノテーションも追加します。すべてのアノテーションされた配列は、標準的な配列ヘッダー形式を用いてリネームされています。各遺伝子のヘッダーには20文字の文字が含まれており、遺伝子のゲノムとコンティグ、ゲノム中の相対位置、コンティグの境界にあるかどうかについてのヒトが読める情報を提供しています。

３、pangenome - パンゲノムの計算

PanACoTAのpangenomeモジュールを用いてパンゲノムの計算を行う。入力は全ゲノムの全タンパク質の集合で、annotateモジュールで生成されたProteinsディレクトリの配列ファイル（.prtファイル; multi-fasta）を使う(ヘッダーがPanACoTAの命名規則に則っている)。リストファイルはannotateモジュールで生成されたLSTINFO-.txt（マニュアルの解説）をそのまま使用できる。

PanACoTA pangenome -d annotate/Proteins -l annotate/LSTINFO-.lst -n <pangenome-name> -o outdir --threads 20

-l File containing the list of all genomes to include in the pan-genome, 1 genome per line: it can be the LSTINFO-<list_file>.lst file of 'PanACoTA annotate' module

-n Name of the dataset which will be clustered (for example, SAEN1234 for 1234 Salmonella enterica genomes). This name will be used to name the protein databank, a well as the pangenome files.
-d Path to the folder containing all protein files corresponding to the genomes of the dataset (output directory 'Proteins' of 'PanACoTA annotate' module).
-o Output directory, where all results must be saved (including tmp folder)
-i Minimum sequence identity to be considered in the same cluster (float between 0 and 1). Default is 0.8.
-f Use this option if you want to give the name of the pangenome output file (without path).
-c {0,1,2} Choose the clustering mode: 0 for 'set cover', 1 for 'single-linkage', 2 for 'CD-Hit'. Default is 'single-linkage' (1)
-s use this option if you want to save the concatenated protein databank in another directory than the one containing all individual protein files ('Proteins' folder).
--threads add this option if you want to parallelize on several threads. Indicate on how many threads you want to parallelize. By default, it uses 1 thread. Put 0 if you want to use all threads of your computer.

出力

f:id:kazumaxneo:20210908112321p:plain

パンゲノムファイルは、1ファミリーにつき1行で構成されている。最初の列はファミリー番号で、その他はすべてのファミリーメンバーになる。

PanGenome-pangenome.All.prt-clust-0.8-mode1.lst

f:id:kazumaxneo:20210908112853p:plain

各ファミリーごとに1行、全メンバーのリストを含む。

PanACoTAは、各ファミリーの構成の概要を示す表形式のファイルと、パンゲノムの量的および質的マトリックスも提供する。

マニュアルとpreprintより

　pangenomeはゲノム中のすべてのタンパク質ファミリーの集合です。その計算には、すべてのタンパク質のペア間の比較が含まれており、その複雑さは遺伝子の数（つまりゲノムの数）の二乗になります。信頼性の高いパンゲノムを合理的な時間で生成するために、PanACoTAはMMseqs2スイートを呼び出します。mmseqs検索モジュールは、非常に優れたスピードと感度のトレードオフを持っています。時間を短縮するために、感度を上げたり、速度を下げたりしながら、3つの連続した検索ステージを使用しています。すべてが高度に並列化され、複数のレベルで最適化されています。最初のステップでは、高い非類似性、すなわち、少なくとも2つの連続したkmerマッチを持たない配列を排除することで、最大99.9%の配列をフィルタリングします。第2段階では、ギャップなしのアラインメントを使用して残りの99%の配列をフィルタリングします。これにより、完全なアラインメントではなく、スコアのみが計算されるSmith-Watermanアライメントの最適化されたバージョンを使用して処理する配列の量が少なくなります。

　MMseqs2スイートに含まれるmmseqs clusterモジュールを使用し、デフォルトのCascaded clusteringオプションを設定しました。このモジュールは主に2つのステップで動作します。まず、線形時間タンパク質配列クラスタリングアルゴリズムであるlinclustをプレフィルターとして使用してタンパク質をクラスタリングします。次に、この最初のステップの代表的な配列が mmseqs 検索モジュールによって処理され、その結果に応じてクラスタリングされます。この第２ステップを３回繰り返し、その都度mmseqs検索アルゴリズムモジュールで高感度化します。

　クラスタリングの段階では、PanACoTAはConnected componentモードを使用します。このモードでは、相同遺伝子のペアをマージするために遷移的接続を使用します。各ノードはタンパク質であり、2つのタンパク質が類似している（与えられた閾値以上の類似度）場合、2つのタンパク質の間にエッジが存在します。各ノードはタンパク質であり、2つのタンパク質が類似していれば（類似度が与えられた閾値を超えていれば）、2つのタンパク質の間にエッジがあります。必要に応じて、ユーザーは、PanACoTAパンゲノムモジュールを起動している間に、専用のパラメータを使用して、他の2つのクラスタリングモード（Greedy Set cover、またはGreedy Incremental）のいずれかを選択することができます。重要なことは、mmseqs2のオプションを調整することで、配列類似性解析を非常に高速に、または非常に高感度に行うことができるということです。PanACoTAでは、ユーザーは主要なパラメータ-min-seq-idと-cluster-modeを変更し、結果への影響を調べるためにmmseqsクラスタモジュールを再実行することができます。より具体的なmmseqs2パラメータは、当面の間、スタンドアロン版のプログラムで使用されます。

　なお、ドラフトゲノムの場合、ゲノム中の遺伝子間のシンテニーは考慮していませんが、ドラフトゲノムの場合、これは限られた興味しかないと考えています。生成されたファミリーを調べてみると、ほとんど差がなく、有意ではないことがわかりました。しかし、ユーザーが非常によくアセンブルされたゲノムを持っている場合や、シンテニーを考慮する特別な理由がある場合には、panOCT、SynerClust、PANINIなどのツールが開発されています。出力されるパンゲノムのファミリーごとの要素数を示すファイルは、直接TreeWASの入力として使用することができます。

４、corepers - コア・persistent遺伝子ファミリーの計算

corepersモジュールによって、多数の分類群に存在するコア・persistent遺伝子ファミリーの分類を行う。これには、３のpangenomeモジュールで生成されたpangenomeファイルを指定する。"-t"オプションで、ゲノムの何パーセントに含まれている遺伝子をpersistentな遺伝子（*1）とするか定義する（デフォルトは1.0で全てのゲノムに存在するコア遺伝子となっている）。”-X"や"-M"オプションを使うと定義をより緩やかにすることができる。

PanACoTA corepers -p pangenome/PanGenome-pangenome.All.prt-clust-0.8-mode1.lst -o core

#90%
PanACoTA corepers -p pangenome/PanGenome-pangenome.All.prt-clust-0.8-mode1.lst -o core -t 0.9

#90%、複数ヒット許可（全ゲノム）
PanACoTA corepers -p pangenome/PanGenome-pangenome.All.prt-clust-0.8-mode1.lst -o core -t 0.9 -M

-p PanGenome file (1 line per family, first column is fam number)
-o Specify the output directory for your core/persistent genome.
-t min % of genomes having at least 1 member in a family to consider the family as persistent (between 0 and 1, default is 1 = 100% of genomes = Core genome).By default, the minimum number of genomes will be ceil('tol'*N) (N being the total number of genomes). If you want to use floor('tol'*N) instead, add the '-F' option.
-M Add this option if you allow several members in any genome of a family. By default, only 1 (or 0 if tol<1) member per genome are allowed in all genomes. If you want to allow exactly 1 member in 'tol'% of the genomes, and 0, 1 or several members in the '1-tol'% left, use the option -X instead of this one: -M and -X options are not compatible.
-X Add this option if you want to allow families having several members only in '1-tol'% of the genomes. In the other genomes, only 1 member exactly is allowed. This option is not compatible with -M (which is allowing multigenic families: having several members in any number of genomes).
-F When you specify the '-tol' option, with a number lower than 1, you can add this option to use floor('tol'*N) as a minimum number of genomes instead of ceil('tol'*N) which is the default behavior.

出力

f:id:kazumaxneo:20210908114219p:plain

PersGenome_PanGenome-pangenome.All.prt-clust-0.8-mode1.lst-all_1.lst

f:id:kazumaxneo:20210908114337p:plain

マニュアルとpreprintより

初期の研究では、パンゲノムマトリックスを用いて、コアゲノムという単一コピーで全ゲノムに存在する遺伝子ファミリーを同定していました。しかし、データセットのゲノム数の増加に伴い、コアゲノムのサイズは大幅に減少する傾向にあります。これは、シーケンスやアノテーションの誤りや、集団内でのまれな欠失多型が、入力ゲノム数の増加に伴ってコア遺伝子の数を急速に減少させるためです。この問題を克服するために、現在では一般的にpersistentゲノムを同定することが行われています。あるファミリーは、ゲノムの少なくともN%のメンバーを含んでいれば、persistentゲノムに含まれていることになります。Nはユーザーが定義します。デフォルト値は、1000以上のゲノムを持つデータセットでは95%です。persistentゲノムは、まれな（真の、または人工的な）変種に対してよりロバストです。一方、persistentゲノムを計算する目的が系統樹の作成や集団遺伝学データの解析である場合、異なるしきい値を持つセットを作成したいと思うかもしれません。実際、あまりにも高い値を設定すると、ギャップの少ないpersistentな遺伝子のセットが小さくなり、堅牢な系統樹を推定するのに十分な値になるかもしれません。一方で、これは、正の選択や組換えイベントの検出のような他のタイプの解析から多くの遺伝子ファミリーを除外することになります。また、persistentなゲノムの定義も、その後のデータの利用方法によって異なる場合があります。PanACoTAでは、3つのタイプのpersistentゲノムを定義しています（論文図3参照）。

Strict-persistent: 少なくともN%のゲノムに正確に1つのメンバーを含むファミリー（N = 100はコアファミリーであることを意味します）。この定義は、ゲノムごとに複数のコピーが存在することを考慮せずに系統を再構築する場合に特に有効です。

Mixed-persistent: 少なくともN%のゲノムが正確に1つのメンバーを持ち、他のゲノムが0か、複数のメンバーを持つファミリー。この定義は他の2つの間の中間的なもので、厳密-persistentなものを含み、multi-persistentなものも含まれます。

Multi-persistent: N%のゲノムに少なくとも1つのメンバーを持つファミリー。この定義は、ほぼユビキタスなタンパク質ファミリーの多様化のパターンを分析する上で興味深い。

ユーザーが固定したしきい値を使用するのではなく、統計的アプローチを使用してpersistentなゲノムを識別したい場合、アノテートモジュールによって生成されたgffファイルはPPanGGOLiNと互換性がある。PPanGGOLiNは、ここでのmulti-persistentな定義でのpersistentゲノムを生成します。

５、align - persistentな遺伝子ファミリーのアラインメントを行う

alignモジュールによってpersistentな遺伝子ファミリーのアラインメントを行う。”-c”で定義したコア・persistent遺伝子ファミリーファイルを指定する。”-l”で指定するリストファイルには、annotateモジュールで出力されたLSTINFO-.txtが使える。”-d”では、annotateモジュールで出力されたディレクトリを指定する。計算時間がかかるので、”-threads”オプションを使ってスレッド数を増やせるだけ増やしておく。”-n”で指定した名前は、出力される全てのファイルのprefixとして使用される。

PanACoTA align -c core/PersGenome_PanGenome-pangenome.All.prt-clust-0.8-mode1.lst-all_1.lst \
 -l annotate/LSTINFO-.lst -n <name_of_the_dataset> -d annotate/ -o align --threads 20

-c Core or persistent genome whose families must be aligned.
-l File containing the list of all the genomes you want to align from their core/persistent families. 1 genome per line: it can be the LSTINFO-<list_file>.lst file of 'PanACoTA annotate' module. Here, only the first column (genome name without extension) will be used. The final alignment file will contain 1 alignment per genome in this file.
-n Name of the dataset which will be aligned (for example, SAEN1234 for 1234 Salmonella enterica genomes). This name will be used to name the alignment file.
-d Path to the folder containing the directories 'Proteins' and 'Genes', created by 'PanACoTA annotate'.
-o Output directory, where all results must be saved
--threads add this option if you want to parallelize on several threads. Indicate on how many threads you want to parallelize. By default, it uses 1 thread. Put 0 if you want to use all threads of your computer.

出力

f:id:kazumaxneo:20210908144646p:plain

phylo-xxx/にすべてのコア・persistent遺伝子ファミリーのアラインメントを連結したファイルが出力される（xxxは"-n"で指定したprefix）。このファイルをツリー推論に使う。出力ディレクトリには、各コア・persistent遺伝子ファミリーについて、その遺伝子とタンパク質の配列をアラインメントしたファイルなども含まれている。

マニュアルとpreprintより

　厳密なpersistentゲノム（コアゲノム）を使用する場合は、すべての遺伝子がアラインメントされます。他の定義のpersistentゲノムを使用した場合、遺伝子を欠いていたり、複数のコピーを持っていたりするゲノムもありますが、これらのアラインメントから系統樹を作成するためには、あらかじめそのような場合に対応する必要があります。ある遺伝子ファミリーのメンバーが欠落していたり、複数のメンバー（混合またはmulti-persistent）が存在する場合、PanACoTAは他のアラインメントされた遺伝子と同じ長さのギャップ（'-'）を追加します。いくつかの"-"を追加しても、系統再構成にはほとんど影響を与えません。例えば、アラインメントマトリックスに最大60%の欠落データを追加しても、有益なアラインメントが得られることが示されています[ref.42]。著者らの経験では、このアプローチを種内パンゲノムに適用した場合、通常は1%未満の欠落が含まれています。このように、欠損データの影響は、より多くの遺伝子からの系統シグナルを使用する利点に比べれば、無視できる程度のものであるはずです（すなわち、厳密な一貫性のあるゲノムを使用する場合とは対照的です）。アライメントは、タンパク質配列のレベルで行うと、より正確になります。これは、コドンベースのヌクレオチド・アライメントを生成するという付加的な利点があり、コーディング配列に対する選択圧を研究するために使用することができます。そのため、PanACoTAは配列を翻訳し、対応するタンパク質をアラインメントした後、それらをDNAに逆翻訳してヌクレオチドアラインメントを取得します。この最後のステップは、各アミノ酸を元のコドンに置き換えることで構成されています。したがって、このプロセスの最後には、アラインメントされた配列は元の配列と同じになります。

　PanACoTAは、MAFFTを用いた多重配列アラインメントを行います。PanACoTAには、非常に大規模なデータセットを扱うために、ある程度の精度を犠牲にしながらも、はるかに高速なアライメントを行うことができるオプションがあります。この精度の低下は、種内のオーソロガスな遺伝子ファミリーの場合のように、非常に類似した配列の場合では影響は低く、PanACoTAはpersistentなゲノムのアラインメントを非常に迅速に行うことができることを意味します。

６、tree - ツリーの再構成

系統推論は、PanACoTAのtreeモジュールを用いて行う。このモジュールは、AlignモジュールのアラインメントやFastaフォーマットの他のアラインメントを入力として使う。

#iqtree
PanACoTA tree -a align/phylo-xxx/xxx.nucl.grp.aln -o tree --threads 40 -b 1000

#fasttree ("FastTreeMP"しか認識しないので注意)
PanACoTA tree -a align/phylo-xxx/xxx.nucl.grp.aln -o tree -s fasttree

このステップまでくれば、PanACoTAに頼らずにiqtreeなどを走らせても良いと思われます。

ちなみに、系統樹に視覚化した時、アノテーションをかけてリネームされているので、元の種名や株名が分からなくなります。その時はannotate/のLSTINFO-か、prepare/のリストファイルを見て下さい。

マニュアルとpreprintより

系統推論に必要な時間は、データセットのサイズに応じて非常に速く増加するため、この部分がパイプライン全体の中で最も時間がかかる部分です。最尤解析の効率的な実装でさえ、サイト数と分類群数の積でスケーリングしてしまうため、大規模なデータセット（数千の分類群、各分類群のサイト数が1万サイト以上）の場合には問題となります。PanACoTAは系統樹を得るためにいくつかの異なる方法を提案している。IQ-TREE、FastTreeME、fastME、Quicktree です。どのようなソフトウェアを使用しても、ツリーモジュールはFasta形式のヌクレオチドアラインメント（例えば、alignモジュールの出力のようなもの）を入力として受け取り、Newick形式のツリーを返します。ニーズに応じて、ユーザーはこれらの方法のいずれかを選択して系統樹を推論することができます。これらのツリーは、IQ-TREEのオプションを変更するなどして、より厳密な系統樹推論を行うために使用することができます。

感想

　これまでのパンゲノム解析では、ゲノムを集めたりアノテーションをつけるのはユーザーの責任として、ツールとは無関係のステップになっていました（もしくは、おまけ扱いでスクリプトのみ利用できた）。しかし、パンゲノムの計算では、入力データの品質が鍵になります。ドラフトゲノムの品質の悪さやアノテーションの揺らぎに起因するコアゲノムセットの変動をできる限りゼロに近づけるのは極めて重要な事です。このPanACoTAは、モジュール方式を採用することで、その前段階ステップまで踏み込んでコードと環境を整備しており、一定の水準のゲノムだけを再現性良く集めることができます。均一な水準で得られたゲノム情報だけ使う事で、より正確にパンゲノムを定義できるはずです。

　ツールの完成度が高いだけでなく、細部まで説明された完全なマニュアルと、ユーザーの立場に立ったガイドが提供されています。初めて微生物の比較ゲノム解析を行おうとしている人にも利用しやすい環境が整えられているのではないでしょうか。個人的には、このツールは開発の終了したroaryの乗り換え先に使えそうに見えたので、プレプリント発表直後から注目していました。しかしなかなかまとまった時間が無くて、まとめるのに時間がかかってしまいました。正式なDocumentを見た方が良いのは間違いありませんが、よろしければ参考になさって下さい。

引用

PanACoTA: a modular tool for massive microbial comparative genomics
Amandine Perrin, Eduardo P.C. Rocha

NAR Genom Bioinform. 2021 Mar; 3(1): lqaa106.

PanACoTA: A modular tool for massive microbial comparative genomics

Amandine Perrin, Eduardo P.C. Rocha

bioRxiv, Posted September 18, 2020

*１

persistent遺伝子

コア遺伝子を厳密に定義する時の問題点として、解析対象のゲノム数が増えると、gene lossや技術的な問題（アセンブリ、アノテーションなど）によってコアゲノムのサイズが減少することが知られています。よくあるのは、1個か2個だけ品質が低かったり分類が間違って登録されているゲノムが含まれているために、100％保存されているコア遺伝子が大きく減ってしまうパターンです（例）。そのため，大多数のゲノムで保存されているpersistent（持続的な、落ちない、変質しにくい）な遺伝子に注目することが提案されています。

パンゲノム解析によってアノテーション情報の改善を試みる panaroo

2020 Preprint pan-genome bacteria Cytoscape assembly graph quality check evaluation tool ゲノム比較 (comparative genomics) 結果の視覚化 (visualization) PAV (presence/absence variations) Genome Biology

Preprintより

　原核生物のゲノム進化は、親から子への遺伝物質の垂直伝達と生物間の水平遺伝子伝達の両方によって引き起こされる（ref.1）。細菌の大規模なシーケンシング研究から、種内ゲノム含有量に大規模な違いが生じることが確認されている（ref.2）。これにより、種全体で発見されたすべての遺伝子のセットであるパンゲノム研究に至った（ref.3）。パンゲノム内では、遺伝子は「コア」ゲノムの一部、種のすべてのメンバーに存在する遺伝子のセット、または非コアゲノム（「アクセサリ」）に分けられる。細菌ゲノムデータセットのパンゲノムを推定する際の一般的な問題は、共有配列の同一性によって定義される相同遺伝子を、オルソログまたはパラログのクラスターに分類することである。オルソログは、共通の祖先の同じ祖先配列から割り当てられた相同な遺伝子であり、遺伝子の重複や獲得によるものではない。バクテリアのパンゲノムを分析するとき、遺伝子またはタンパク質の機能だけでなく、その場所にも関心があることがよくある。2つのほぼ同一の遺伝子がゲノムの異なる場所で異なる調節を受けている可能性があるからである。したがって、パンゲノム解析の多くのプログラムは、位置情報を使用してパラログをさらに識別する。これは、2つの遺伝子が遺伝子の重複により同じ祖先配列から外れた場合、またはホモログが水平に取得された場合に発生する。
　パンゲノムを推定するこれまでのアプローチには、Roary、OrthoMCL、PanOCT、PIRATE、PanX、PGAP、COG-soft、MultiParanoidが含まれる（ref.4〜10）。パンゲノムを決定する方法の大半は、2つの同様のアプローチのいずれかを使用する傾向がある（論文補足図1）。ほとんどの場合、CD-HIT、BLAST、DIAMONDなどの相同性検索ツールを使用して、定義済みの遺伝子配列間の類似性を推測することから始まる（ref.11–13）。この出力を使用して、ペアワイズ距離行列が作成され、一般的なマルコフクラスタリングアルゴリズム（MCL）を使用するか、ペアワイズベストヒット（BeTs）のトライアングルを調べることにより、遺伝子が直交グループにクラスター化される（ref.14、15）。これらの方法のサブセットは、各遺伝子の近傍またはゲノムコンテキストを使用して、オルソロガスクラスターをパラログにさらに分割することによって継続する。
　細菌ゲノム集団の研究が大きくなるにつれて、ゲノムアノテーションの正確さやゲノムアセンブリの連続性の対応する増加はなかった。したがって、これらのデータベースが成長するにつれて、誤った遺伝子アノテーションの数も増加した。これは、結果として生じるパンゲノムの推定値に大きな意味を持ち、それにより、より多くのゲノムがより多くのエラーにつながる（ref.16、17）。このようなエラーは、アクセサリーゲノムの遺伝子座を介して作用するNFDSのモデリングなど、パンゲノムの下流モデリングに問題を引き起こす可能性がある（ref.18、19）。断片化されたアセンブリ、誤ったアノテーション、汚染、および誤ったアセンブリにより、エラーがパンゲノム解析に導入される可能性がある。デントンらは、断片化されたアセンブリが真核生物のドラフトゲノムにおける遺伝子数の増加の主な原因であることを示した（ref.17）。エラーは、しばしばアクセサリーゲノムのサイズの推定にインフレーションをもたらすが、アノテーションソフトウェアが遺伝子を特定できない場合、または遺伝子がアセンブリの中断によって断片化された場合、推定サイズを縮小する場合、コアゲノム遺伝子の欠落につながる可能性がある。現在のパンゲノム推論アルゴリズムの多くは、シミュレートされたデータを使用した厳密な検証の対象ではない。その結果、初期のゲノムアノテーションで発生するエラーに対処する能力は限られた注目を集めた。ここでは、パンゲノムを推定する別のアプローチ、Panarooを提示する。これは、グラフベースのアルゴリズムを使用してゲノム間で情報を共有し、アノテーションエラーの多くの原因を修正できるようにする。panarooは、データセット内の各ゲノムによって提供される追加情報を活用して、アノテーションコールを改善し、その結果、パンゲノム内のオルソログとパラログのクラスタリングを改善する。また、提供する分析パッケージをさらに充実させ、統合されたデータ品質管理、遺伝子関連分析を可能にし、種間のパンゲノムの比較を可能にする多数の分析前および分析後スクリプトを提供する。 Panarooはパンゲノムの完全なグラフ表現を構築するので、結果のグラフ内に構造的変動を調査することができ、構造的変動と表現型との関連付けを呼び出すことができる。 Infinitely Many Genesモデル（ref.20）を使用した広範なシミュレーションと、Global Pneumococcal Sequencing（GPS）プロジェクトの主要なクレードを含む大規模な細菌ゲノムデータセットの多様な分析（ref.21）により、アルゴリズムの成功を実証する。 Panarooの出力を、パンゲノムを分析するための従来のゴールド標準法と比較し、Panarooが優れたオルソログクラスターを生成し、多くの場合、推定されるアクセサリーゲノムのサイズの大幅な減少とコアゲノムサイズの増加の両方をもたらすことを示す。
　Panarooはパンゲノムの完全なグラフィック表現を構築する。ノードはオルソロガス遺伝子（COG）のクラスターであり、2つのノードは母集団のサンプルのコンティグ上で隣接している場合、エッジで接続される。このグラフィカルな表現を使用して、Panarooは、アノテーション中に導入されたエラーを修正する。多様な遺伝子ファミリーの崩壊、汚染のフィルタリング、断片化された遺伝子セグメントのマージ、欠損遺伝子の再発見。 Panarooは、CD-HITを使用して初期遺伝子クラスターを生成し、すべてのサンプルのすべての遺伝子配列のコレクションをクラスター化する。パラログは、各クラスターに各ゲノムを1回だけ存在させることで分割される。断片化または誤って翻訳された遺伝子は、各ノードの近傍情報に基づいて識別およびマージされる。多様な遺伝子ファミリーは、緩和されたアライメントのしきい値と、グラフから得られた近傍情報。最後に、1つまたは複数のサンプルから潜在的に欠落している遺伝子がグラフで識別され、遺伝子の存在を確認するために隣接ノード近くのコンティグシーケンスが検索される。
　Panarooは入力としてGFF3形式のアノテーション付きアセンブリを取得し、Cytoscapeまたは他のグラフ視覚化ソフトウェアで表示するためのGML形式の完全なアノテーション付きグラフだけでなく、遺伝子存在/不在のマトリックス（Roaryなど）を含むさまざまな出力形式を生成する。 Panarooパッケージには、初期の品質管理や、パンゲノムサイズ、遺伝子の増加率と損失率の決定、および一致する遺伝子の特定に使用できる多数の前処理および後処理スクリプトが含まれている。 Panarooは、最新バージョンのpyseerを含む他の多くのパンゲノム解析パッケージと簡単に連携して、表現型と遺伝子の有無の関連付け、および調査対象のグラフの構造的変化を可能にする（ref.24）。パッケージはpythonで記述されており、https://github.com/gtonkinhill/panaroo/からオープンソースMITライセンスの下で入手できる。

Happy to see Panaroo published! Thanks to everyone involved and to the reviewers for managing to find time amongst all the chaos. https://t.co/IEesvqhiS9
— Gerry Tonkin-Hill (@gerrythill) 2020年7月24日

Happy to introduce our new prokaryotic pangenome clustering tool Panaroo! https://t.co/cNIjm6xkK1. We show that annotation errors can have a large impact on pangenome inference and that by sharing information between genomes we can correct for this.
— Gerry Tonkin-Hill (@gerrythill) 2020年1月29日

manual

https://gtonkinhill.github.io/panaroo/#/gettingstarted/installation

インストール

Github

#bioconda（link）
conda create -n panaroo -y
conda activate panaroo
conda install -c bioconda panaroo

> panaroo -h

$ panaroo -h

usage: panaroo [-h] -i INPUT_FILES [INPUT_FILES ...] -o OUTPUT_DIR [-c ID]

[-f FAMILY_THRESHOLD] [--len_dif_percent LEN_DIF_PERCENT]

[--merge_paralogs] [--search_radius SEARCH_RADIUS]

[--refind_prop_match REFIND_PROP_MATCH]

[--mode {strict,moderate,relaxed}]

[--min_trailing_support MIN_TRAILING_SUPPORT]

[--trailing_recursive TRAILING_RECURSIVE]

[--edge_support_threshold EDGE_SUPPORT_THRESHOLD]

[--remove_by_consensus {True,False}]

[--high_var_flag CYCLE_THRESHOLD_MIN]

[--min_edge_support_sv MIN_EDGE_SUPPORT_SV]

[--all_seq_in_graph] [--no_clean_edges] [-a {core,pan}]

[--aligner {prank,clustal,mafft}] [--core_threshold CORE]

[-t N_CPU] [--verbose] [--version]

panaroo: an updated pipeline for pan-genome investigation

optional arguments:

-h, --help show this help message and exit

-t N_CPU, --threads N_CPU

number of threads to use (default=1)

--verbose print additional output

--version show program's version number and exit

Input/output:

-i INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]

input GFF3 files (usually output from running Prokka).

Can also take a file listing each gff file line by

line.

-o OUTPUT_DIR, --out_dir OUTPUT_DIR

location of an output directory

Matching:

-c ID, --threshold ID

sequence identity threshold (default=0.95)

-f FAMILY_THRESHOLD, --family_threshold FAMILY_THRESHOLD

protein family sequence identity threshold

(default=0.7)

--len_dif_percent LEN_DIF_PERCENT

length difference cutoff (default=0.95)

--merge_paralogs don't split paralogs

Refind:

--search_radius SEARCH_RADIUS

the distance in nucleotides surronding the neighbour

of an accessory gene in which to search for it

--refind_prop_match REFIND_PROP_MATCH

the proportion of an accessory gene that must be found

in order to consider it a match

Graph correction:

--mode {strict,moderate,relaxed}

the stringency mode at which to run panaroo. One of

'strict', 'moderate' or 'relaxed' (default='strict')

--min_trailing_support MIN_TRAILING_SUPPORT

minimum cluster size to keep a gene called at the end

of a contig

--trailing_recursive TRAILING_RECURSIVE

number of times to perform recursive trimming of low

support nodes near the end of contigs

--edge_support_threshold EDGE_SUPPORT_THRESHOLD

minimum support required to keep and edge that has

been flagged as a possible mis-assembly

--remove_by_consensus {True,False}

if a gene is called in the same region with similar

sequence a minority of the time, remove it. One of

'True' or 'False'

--high_var_flag CYCLE_THRESHOLD_MIN

minimum number of nested cycles to call a highly

variable gene region (default = 5).

--min_edge_support_sv MIN_EDGE_SUPPORT_SV

minimum edge support required to call structural

variants in the presence/absence sv file

--all_seq_in_graph Retains all DNA sequence for each gene cluster in the

graph output. Off by default as it uses a large amount

of space.

--no_clean_edges Turn off edge filtering in the final output graph.

Gene alignment:

-a {core,pan}, --alignment {core,pan}

Output alignments of core genes or all genes. Options

are 'core' and 'pan'. Default: 'None'

--aligner {prank,clustal,mafft}

Specify an aligner. Options:'prank', 'clustal', and

default: 'mafft'

--core_threshold CORE

Core-genome sample threshold (default=0.95)

実行方法

　１、アノテーションの GFFファイルを作る。ユーティリティスクリプトを使うか、prokkaをループで回す(roary参照)。

mkdir reannotated
run_prokka -i *.fasta -o reannotated

２、pre-processing

任意でmash screenを使った入力配列のクオリティチェックを行う。

mkdir qcresults
panaroo-qc -t 3 --graph_type all -i *.gff --ref_db refseq.genomes.k21.s1000.msh -o qcresults

３、panarooの実行

mkdir results
panaroo -i *.gff -o results -t 8

出力

f:id:kazumaxneo:20200307151907p:plain

struct_presence_absence.Rtab

f:id:kazumaxneo:20200307152000p:plain

この後、独立した結果をマージしたり、ゲノムのリアレンジメントを探したり、潜在的な偽遺伝子、異常な長さの遺伝子、アセンブリやアノテーションの過程で断片化された遺伝子をフィルタリングしたりすることができます。また、特定のゲノムにだけ見つかる構造を視覚化したり、パンゲノムのサイズや遺伝子の獲得・ロス率を推定するスクリプトも提供されています。マニュアルを確認して下さい。ここでも、余裕があれば追記したいと思います。

引用

Producing Polished Prokaryotic Pangenomes with the Panaroo Pipeline

Gerry Tonkin-Hill, Neil MacAlasdair, Christopher Ruis, Aaron Weimann, Gal Horesh, John A. Lees, Rebecca A Gladstone, Stephanie Lo, Christopher Beaudoin, R Andrés Floto, Simon D.W. Frost, Jukka Corander, Stephen D. Bentley1, Julian Parkhill

bioRxiv preprint first posted online Jan. 28, 2020

追記

Producing polished prokaryotic pangenomes with the Panaroo pipeline

Gerry Tonkin-Hill, Neil MacAlasdair, Christopher Ruis, Aaron Weimann, Gal Horesh, John A. Lees, Rebecca A. Gladstone, Stephanie Lo, Christopher Beaudoin, R. Andres Floto, Simon D.W. Frost, Jukka Corander, Stephen D. Bentley, Julian Parkhill
Genome Biology volume 21, Article number: 180 (2020)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

MMseqs2 コマンド其の3 既存のデータベースをダウンロードするmmseqs databasesコマンド

（Prokaryotes）ドラフトゲノムのポリッシングを行う Polypolish

エキソームのバリアント解析パイプライン EXOME-pipeline

ゲノムアセンブリの品質、完全性、フェーズ評価を行う Merqury

パンゲノム解析ツール PanACoTAのallコマンドを使う

大規模な微生物の比較ゲノミクスのためのモジュラー式のツール PanACoTA

パンゲノム解析によってアノテーション情報の改善を試みる panaroo