細菌の保存されたタンパク質の割合を計算するためのNextflowパイプライン POCP-nf

2024/05/08 追記

　シーケンス技術の進歩により、細菌ゲノムは飛躍的に増加しており、確実な分類法が必要とされている。Qin et al. (2014)によって最初に提案されたPercentage Of Conserved Proteins (POCP)は、原核生物の属境界を評価するための貴重な指標である。ここでは、分類学的研究における再現性と使いやすさを高めることを目的として、POCPを自動計算するための計算パイプラインを紹介する。POCP-nfパイプラインは、DIAMONDを使用し、BLASTPと同程度の感度でタンパク質のアラインメントを高速化する。パイプラインはNextflowで実装され、CondaとDockerをサポートし、GitHubでhttps://github.com/hoelzer/pocpで公開されている。オープンソースのコードは、様々な原核生物のゲノムやタンパク質のデータセットに簡単に適応できる。詳細なドキュメントと使い方はリポジトリにある。

レポジトリより

シーケンス技術の進歩により、細菌ゲノムは飛躍的に増加しており、確実な分類法が必要とされている。Qin, Xie et al. 2014によって最初に提案されたPOCP（Percentage Of Conserved Proteins）は、原核生物の属境界を評価するための貴重な指標である。原核生物の属は、すべてのペアワイズPOCP値が50％より高い種のグループとして定義できる。ここでは、POCPを自動計算するための計算パイプラインを紹介し、分類学的研究における再現性と使いやすさの向上を目指す。

このPOCPのためのパイプラインは、入力として、ProkkaやBaktaで提供されているような、ゲノムごとに1つのアミノ酸配列FASTAファイル、またはゲノムFASTAファイルを使用する。パイプラインは、DIAMONDのblastpモードを使用して、すべてのタンパク質配列間のall-vs-allペアワイズアラインメントを計算し、Qin, Xie et al. 2014のオリジナルの式に従ってPOCP計算にこの情報を使用する。1対全体の比較も可能である。

インストール

依存

nextflow
For installing the dependencies (such as Prokka and DIAMOND), you can choose between conda, mamba, docker or singularity. Author recommend using docker.

Github

nextflow pull hoelzer/pocp

> nextflow run hoelzer/pocp -r 2.3.0 --help

N E X T F L O W ~ version 22.10.4

Launching `https://github.com/hoelzer/pocp` [disturbed_jones] DSL2 - revision: 092166e822 [2.3.0]

Profile: standard

Current User: kazu

Nextflow-version: 22.10.4

Starting time: 09-12-2022 09:58 UTC

Workdir location:

/home/kazu/work

____________________________________________________________________________________________

P.O.C.P - calculate percentage of conserved proteins.

A prokaryotic genus can be defined as a group of species with all pairwise POCP values higher than 50%.

Usage example:

nextflow run hoelzer/pocp -r 2.3.0 --genomes '*.fasta'

nextflow run hoelzer/pocp -r 2.3.0 --proteins '*.faa'

Use the following commands to check for latest pipeline versions:

nextflow pull hoelzer/pocp

nextflow info hoelzer/pocp

Input

All-vs-all comparisons (default):

--genomes '*.fasta' -> one genome per file

--proteins '*.faa' -> one protein multi-FASTA per file

..change above input to csv: --list

Perform one-vs-all comparison against the additionally defined genome or protein FASTA (optional):

--genome genome.fasta -> one genome FASTA

--protein proteins.faa -> one protein multi-FASTA

General Options:

--gcode Genetic code for Prokka annotation [default: 0]

--cores Max cores per process for local use [default: 32]

--max_cores Max cores (in total) for local use [default: 128]

--memory Max memory for local use [default: 4 GB]

--output Name of the result folder [default: results]

Special Options (Danger Zone!):

ATTENTION: changing these parameters will lead to different POCP values.

If you have good reasons to do that, you must report the changed parameters together with the used pipeline version.

--evalue Evalue for DIAMOND protein search [default: 1e-5]

--seqidentity Sequence identity for DIAMOD alignments [default: 0.4]

--alnlength Alignment length for DIAMOND hits [default: 0.5]

--blastp Use BLASTP instead of DIAMOND for protein alignment (slower, as in the original 2014 publication) [default: false]

Nextflow options:

-with-report rep.html cpu / ram usage (may cause errors)

-with-dag chart.html generates a flowchart for the process tree

-with-timeline time.html timeline (may cause errors)

-resume resume a previous calculation w/o recalculating everything (needs the same run command and work dir!)

Caching:

--condaCacheDir Location for storing the conda environments [default: conda]

--singularityCacheDir Location for storing the Singularity images [default: conda]

-w Working directory for all intermediate results [default: work]

Execution/Engine profiles:

The pipeline supports profiles to run via different Executers and Engines e.g.: -profile local,conda

Executer (choose one):

local

slurm

Engines (choose one):

conda

mamba

docker

singularity

Per default: -profile local,conda is executed.

テストラン

#genome with local and docker
nextflow run hoelzer/pocp -r 2.3.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' -profile local,docker

#protein with SLURM execution and conda
nextflow run hoelzer/pocp -r 2.3.0 --proteins $HOME'/.nextflow/assets/hoelzer/pocp/example/*.faa' -profile slurm,conda

デフォルトでは作業ディレクトリworkに中間ファイルが保存され、出力はresultsに保存される。

出力例

> ls -lth

> column pocp-matrix.tsv

実行方法

ゲノムもしくは遺伝子予測して得たタンパク質のfastaファイルを指定する。

#genome, local and docker
nextflow run hoelzer/pocp -r 2.3.0 --genomes '<path>/<to>/*.fasta' -profile local,docker

#protein, local and mamba
nextflow run hoelzer/pocp -r 2.3.0 --proteins '<path>/<to>/*.faa' -profile local,mamba

レポジトリより

このパイプラインは、DIAMONDをblastpモードで使用して生物種間のオルソログタンパク質を同定する。オリジナルのPOCP論文ではBLASTPを用いてアラインメントを計算している。しかしDIAMONDは高速であり、大きな入力データセットに対してPOCP値を計算する際に有利であるだけでなく、特にパイプラインのデフォルトで有効になっている--ultra-sensitiveモードを使用した場合、BLASTPの感度を達成できる（Buchfink (2021)）。
異なるアライメントプログラムを比較した別の研究では、デフォルト設定以外の感度オプションを選択した場合、DIAMONDがスピード、感度、品質の最適な妥協点を提供することがわかった（Hernández-Salmerón and Moreno-Hagelsieb (2020)）。そこで、POCP-nfのアライメント計算では、より現代的なソリューションとしてBLASTPの代わりにDIAMONDを使う。
Qin, Xie et al. 2014によって定義されたオリジナルのパラメータがデフォルトとなっている。

--evalue 1e-5
--seqidentity 0.4
--alnlength 0.5

引用

POCP-nf: an automatic Nextflow pipeline for calculating the percentage of conserved proteins in bacterial taxonomy

Martin Hölzer

Bioinformatics, Volume 40, Issue 4, April 2024