パンゲノム解析を行う PEPPA - macでインフォマティクス

2020 5/24 インストール手順修正、ヘルプ更新

　細菌ゲノムは、広範な相同組換え、水平遺伝子導入、遺伝子損失、遺伝子重複などの複雑な進化の歴史によって形作られている。細菌ゲノムの定義されたセット内のすべての遺伝子で構成されるパンゲノムは、系統学的推論および集団研究の基礎を提供できる。ここでは、遺伝的に多様化された数千の細菌ゲノムからパンゲノムを構築できるパイプラインであるPEPPAを紹介する。 PEPPAは、ツリーベースおよびシンテニーベースのアプローチを組み合わせて実装し、パラロガス遺伝子を特定および除外する。これにより、全ゲノムおよびコアゲノムMLSTタイピングスキームの構築が容易になる。 PEPPAは、個々のゲノム内の遺伝子および偽遺伝子の一貫したアノテーションをサポートする類似性ベースの遺伝子予測も実装する。これは、専門の手動キュレーションとほぼ同じくらい正確である。 PEPPAパッケージには、PEPPA_parserが含まれている。これは、アクセサリー遺伝子の内容とMLSTコアゲノムの対立遺伝子の違いに基づいてツリーを計算する。また、細菌のパンゲノムの進化をシミュレートするための新しいパイプラインであるSimPanも紹介する。 PEPPAは、経験的およびシミュレートされたデータセットの両方で、4つの最先端のパンゲノムパイプラインと比較された。それは他のどのパイプラインよりも高い精度と特異性を示し、パンゲノムの計算ではそれらとほぼ同じ速さであった。その能力の実証として、PEPPAを使用して、少なくとも80種にわたる3,170の代表的なゲノムから40,000を超える遺伝子の連鎖球菌パンゲノムを構築した。結果として生じる遺伝子と対立遺伝子のツリーは、この属のゲノムの多様性の前例のない概要を提供する。

インストール

ubuntu18.04のdocker環境でテストした（ホストOSはmacos10.14）。

ビルド依存

Required dependencies

mmseqs2
ncbi-blast+
diamond
rapidnj

Optional dependencies

fasttree2 (For --orthology ml)

本体　Github

conda create -n pepra -y
conda activate pepra
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda install mmseqs2 blast diamond rapidnj fasttree -y
#check
command -v mmseqs blastn rapidnj diamond fasttree

#PEPRA
pip install bio-peppa

> PEPPA -h

$ PEPPA -h

2020-05-24 12:15:16.684479 COMMAND: /Users/kazu/anaconda3/envs/pepra/bin/PEPPA -h

usage: PEPPA [-h] [-p PREFIX] [-g GENES] [-P PRIORITY] [-t N_THREAD] [-o nj,ml,sbh] [-n] [--min_cds MIN_CDS] [--incompleteCDS INCOMPLETECDS] [--gtable GTABLE] [--clust_identity CLUST_IDENTITY] [--clust_match_prop CLUST_MATCH_PROP] [--nucl] [--match_identity MATCH_IDENTITY]

[--match_prop MATCH_PROP] [--match_len MATCH_LEN] [--match_prop1 MATCH_PROP1] [--match_len1 MATCH_LEN1] [--match_prop2 MATCH_PROP2] [--match_len2 MATCH_LEN2] [--match_frag_prop MATCH_FRAG_PROP] [--match_frag_len MATCH_FRAG_LEN] [--link_gap LINK_GAP] [--link_diff LINK_DIFF]

[--allowed_sigma ALLOWED_SIGMA] [--pseudogene PSEUDOGENE] [--untrusted UNTRUSTED] [--continue] [--feature FEATURE] [--noncoding] [--metagenome] [--testunit]

[GFF [GFF ...]]

PEPPA.py

(1) Retieve genes and genomic sequences from GFF files and FASTA files.

(2) Group genes into clusters using mmseq.

(3) Map gene clusters back to genomes.

(4) Discard paralogous alignments.

(5) Discard orthologous clusters if they had regions which overlapped with the regions within other sets that had greater scores.

(6) Re-annotate genomes using the remained of orthologs.

positional arguments:

GFF [REQUIRED] GFF files containing both annotations and sequences.

If you have sequences and GFF annotations in separate files,

they can also be defined as: <GFF>,<fasta>

optional arguments:

-h, --help show this help message and exit

-p PREFIX, --prefix PREFIX

[Default: PEPPA] prefix for the outputs.

-g GENES, --genes GENES

[optional] comma delimited filenames of fasta files containing additional genes.

-P PRIORITY, --priority PRIORITY

[optional] comma delimited, ordered list of GFFs or gene fasta files that are more reliable than others.

Genes contained in these files are preferred in all stages.

-t N_THREAD, --n_thread N_THREAD

[Default: 8] Number of threads to use. Default: 8

-o nj,ml,sbh, --orthology nj,ml,sbh

[Default: nj] Method to define orthologous groups.

nj [default], ml (for small dataset) or sbh (extremely large datasets)

-n, --noNeighborCheck

[Default: False] Flag to disable checking of neighborhood for paralog splitting.

--min_cds MIN_CDS [Default: 120] Minimum length for a gene to be used in similarity searches.

--incompleteCDS INCOMPLETECDS

[Default: ''] Allowed types of imperfection for reference genes.

's': allows unrecognized start codon.

'e': allows unrecognized stop codon.

'i': allows stop codons in the coding region.

'f': allows frameshift in the coding region.

Multiple keywords can be used together. e.g., use 'sife' to allow random sequences.

--gtable GTABLE [Default: 11] Translate table to Use. Only support 11 and 4 (for Mycoplasma)

--clust_identity CLUST_IDENTITY

minimum identities of mmseqs clusters. Default: 0.9

--clust_match_prop CLUST_MATCH_PROP

minimum matches in mmseqs clusters. Default: 0.8

--nucl disable Diamond search. Fast but less sensitive when nucleotide identities < 0.9

--match_identity MATCH_IDENTITY

minimum identities in BLAST search. Default: 0.65

--match_prop MATCH_PROP

minimum match proportion for normal genes in BLAST search. Default: 0.5

--match_len MATCH_LEN

minimum match length for normal genes in BLAST search. Default: 250

--match_prop1 MATCH_PROP1

minimum match proportion for short genes in BLAST search. Default: 0.8

--match_len1 MATCH_LEN1

minimum match length for short genes in BLAST search. Default: 100

--match_prop2 MATCH_PROP2

minimum match proportion for long genes in BLAST search. Default: 0.4

--match_len2 MATCH_LEN2

minimum match length for long genes in BLAST search. Default: 400

--match_frag_prop MATCH_FRAG_PROP

minimum proportion of each fragment for fragmented matches. Default: 0.25

--match_frag_len MATCH_FRAG_LEN

minimum length of each fragment for fragmented matches. Default: 50

--link_gap LINK_GAP consider two fragmented matches within N bases as a linked block. Default: 600

--link_diff LINK_DIFF

form a linked block when the covered regions in the reference gene

and the queried genome differed by no more than this value. Default: 1.5

--allowed_sigma ALLOWED_SIGMA

Allowed number of sigma for paralogous splitting.

The larger, the more variations are kept as inparalogs. Default: 3.

--pseudogene PSEUDOGENE

A match is reported as a pseudogene if its coding region is less than a proportion of the reference gene. Default: 0.8

--untrusted UNTRUSTED

FORMAT: l,p; A gene is not reported if it is not greater than "l" and present in less than "p" of GFF files. Default: 450,0.35

--continue continue from a previously stopped run.

--feature FEATURE feature to extract. Be cautious to change this value. DEFAULT: CDS

--noncoding Set to noncoding mode. This is still under development. Equals to

"--nucl --incompleteCDS sife"

--metagenome Set to metagenome mode. This is still under development. Equals to

"--nucl --incompleteCDS sife --clust_identity 0.99 --clust_match_prop 0.8 --match_identity 0.98 --orthology sbh"

--testunit download four E. coli ST131 genomes for testing of PEPPA.

テストラン

#exampleディレクトリができ、４つのGFFがダウンロードされる。
PEPPA --testunit
#run
PEPPA -p examples/ST131 -P examples/GCF_000010485.combined.gff.gz examples/*.gff.gz

スクリプトは以下の３ステップを実行している。

１、PEPPAの実行。オプション"-P"で信頼できるゲノムのgffファイルを指定し、コマンドの最後に他のゲノムのgffファイルを指定する。

PEPPA -P examples/GCF_000010485.combined.gff.gz \
 --min_cds 60 \
 --incompleteCDS s \
 -p examples/ST131 \
 examples/*.gff.gz

-P Comma delimited, ordered list of GFFs or gene fasta files that are more reliable than others. Genes contained in these files are preferred in all stages.
-p prefix for the outputs [Default: PEPPA].
--min_cds Minimum length for a gene to be used in similarity searches [Default: 150].
--incompleteCDS Allowed types of imperfection for reference genes [Default: ''] .
's': allows unrecognized start codon.
'e': allows unrecognized stop codon.
'i': allows stop codons in the coding region.
'f': allows frameshift in the coding region.
Multiple keywords can be used together. e.g., use 'sife' to allow random sequences.

出力

PEPPA.gff（PEPRAによって予測された全てのパン遺伝子がGFF3で記述されている）とalleles.fna。

２、サマリーの出力

#for PEPPA predicted CDSs and pseudogenes
PEPPA_parser -g examples/ST131.PEPPA.gff -s examples/PEPPA_out -m -t -c -a 95

#for PEPPA predicted CDSs only
PEPPA_parser -g examples/ST131.PEPPA.gff -s examples/PEPPA_out -m -t -c -a 95 -P

全ゲノムの遺伝子のpresence/absenceの行列ファイルとこれに基づくツリーファイル、コア遺伝子とアクセサリ遺伝子のrare fraction（wiki） curveファイルなどが出力される。

引用

Accurate reconstruction of the pan- and core- genomes of bacteria with PEPPA

Zhemin Zhou* and Mark Achtman

bioRxiv preprint, Posted January 03, 2020