パンゲノム解析のためbacteria populationsをシミュレートする SimPan

　細菌ゲノムは、広範な相同組換え、水平遺伝子導入、遺伝子損失、遺伝子重複などの複雑な進化の歴史によって形作られている。細菌ゲノムの定義されたセット内のすべての遺伝子で構成されるパンゲノムは、系統学的推論および集団研究の基礎を提供できる。ここでは、遺伝的に多様化された数千の細菌ゲノムからパンゲノムを構築できるパイプラインであるPEPPAを紹介する。 PEPPAは、ツリーベースおよびシンテニーベースのアプローチを組み合わせて実装し、パラロガス遺伝子を特定および除外する。これにより、全ゲノムおよびコアゲノムMLSTタイピングスキームの構築が容易になる。 PEPPAは、個々のゲノム内の遺伝子および偽遺伝子の一貫したアノテーションをサポートする類似性ベースの遺伝子予測も実装する。これは、専門の手動キュレーションとほぼ同じくらい正確である。 PEPPAパッケージには、PEPPA_parserが含まれている。これは、アクセサリー遺伝子の内容とMLSTコアゲノムの対立遺伝子の違いに基づいてツリーを計算する。また、細菌のパンゲノムの進化をシミュレートするための新しいパイプラインであるSimPanも紹介する。 PEPPAは、経験的およびシミュレートされたデータセットの両方で、4つの最先端のパンゲノムパイプラインと比較された。それは他のどのパイプラインよりも高い精度と特異性を示し、パンゲノムの計算ではそれらとほぼ同じ速さであった。その能力の実証として、PEPPAを使用して、少なくとも80種にわたる3,170の代表的なゲノムから40,000を超える遺伝子の連鎖球菌パンゲノムを構築した。結果として生じる遺伝子と対立遺伝子のツリーは、この属のゲノムの多様性の前例のない概要を提供する。

PEPPAは以前紹介しました。ここではSimPanを紹介します。

http://kazumaxneo.hatenablog.com/entry/2020/02/16/073000

インストール

pytho3.7の仮想環境でテストした(ホストOS; ubuntu18.04LTS)。

依存

SimPan runs in Python with versions >= 3.5 and requires two libraries:

numpy
ete3

本体　Github

git clone https://github.com/zheminzhou/SimPan.git
cd SimPan/

> python SimPan.py -h

# python SimPan.py -h

usage: SimPan.py [-h] [-p PREFIX] [--genomeNum GENOMENUM] [--geneLen GENELEN]

[--igrLen IGRLEN] [--backboneBlock BACKBONEBLOCK]

[--mobileBlock MOBILEBLOCK] [--operonBlock OPERONBLOCK]

[--aveSize AVESIZE] [--nBackbone NBACKBONE] [--nCore NCORE]

[--nMobile NMOBILE] [--pBackbone PBACKBONE]

[--pMobile PMOBILE] [--tipAccelerate TIPACCELERATE]

[--rec REC] [--recLen RECLEN] [--seqRec SEQREC]

[--insRec INSREC] [--delRec DELREC] [--noSeq]

[--idenOrtholog IDENORTHOLOG] [--idenParalog IDENPARALOG]

[--idenDuplication IDENDUPLICATION] [--indelRate INDELRATE]

[--indelLen INDELLEN] [--freqStart FREQSTART]

[--freqStop FREQSTOP]

SimPan is a simulator for bacterial pan-genome.

Global phylogeny and tree distortions are derived from SimBac and the gene and intergenic sequences are simulated using INDELile.

optional arguments:

-h, --help show this help message and exit

-p PREFIX, --prefix PREFIX

prefix for all intermediate files and outputs. {DEFAULT: SimPan]

--genomeNum GENOMENUM

No of genome in population [DEFAULT: 20]

--geneLen GENELEN [negative bionomial with r=2] mean,min,max sizes of genes [DEFAULT: 900,150,6000]

--igrLen IGRLEN [negative bionomial] mean,min,max sizes of intergenic regions [DEFAULT: 50,0,300]

--backboneBlock BACKBONEBLOCK

[geometric] mean,min,max number of backbone genes per block [DEFAULT: 3,0,30]

--mobileBlock MOBILEBLOCK

[geometric] mean,min,max number of mobile genes per block [DEFAULT: 10,0,100]

--operonBlock OPERONBLOCK

[geometric] mean,min,max number of continuous genes that share the same coding strand [DEFAULT: 3,0,15]

--aveSize AVESIZE average gene number per genome (greater than nBackbone). [DEFAULT: 4500]

--nBackbone NBACKBONE

number of backbone genes (present in common ancestor) per genome. [DEFAULT: 4000]

--nCore NCORE sizea of core gene (smaller than the size of backbone genes). [DEFAULT: 3500]

--nMobile NMOBILE size of mobile gene pool for accessory genome. [DEFAULT: 20000]

--pBackbone PBACKBONE

propotion of paralogs in backbone (core) genes. [DEFAULT: 0.05]

--pMobile PMOBILE propotion of paralogs in mobile (accessory) genes. [DEFAULT: 0.4]

--tipAccelerate TIPACCELERATE

grandient increasing of gene indels in recent times. [DEFAULT: 100]

--rec REC expected coverage of homoplastic events in pairwise comparisons. [DEFAULT: 0.05]

--recLen RECLEN expected size of homoplastic events. [DEFAULT: 1000]

--seqRec SEQREC Use homoplastic events to infer sequences. Use 0 to disable [DEFAULT: 1]

--insRec INSREC Use homoplastic events to infer gene insertions. Use 0 to disable [DEFAULT: 1]

--delRec DELREC Use homoplastic events to infer gene deletions. Use 0 to disable [DEFAULT: 1]

--noSeq Do not infer sequence but only the gene presence/absence. [DEFAULT: False]

--idenOrtholog IDENORTHOLOG

average nucleotide identities for orthologous genes. [DEFAULT: 0.98]

--idenParalog IDENPARALOG

average nucleotide identities for paralogous genes. [DEFAULT: 0.6]

--idenDuplication IDENDUPLICATION

average nucleotide identities for recent gene duplications. [DEFAULT: 0.995]

--indelRate INDELRATE

average frequency of indel events relative to mutation rates. [DEFAULT: 0.01]

--indelLen INDELLEN average size of short indel events within each gene. [DEFAULT: 10]

--freqStart FREQSTART

frequencies of start codons of ATG,GTG,TTG. DEFAULT: 0.83,0.14,0.03

--freqStop FREQSTOP frequencies of stop codons of TAA,TAG,TGA. DEFAULT: 0.63,0.08,0.29

実行方法

それぞれ50遺伝子からなる10のゲノムをシミュレートする。

python SimPan.py --aveSize 50 --nBackbone 30 --nMobile 1000 -p test --genomeNum 10

はじめに、SimBacを使用して細菌ゲノムのグローバルな系統発生および組換えイベントがシミュレートされる。次に、ランダムなindelイベントにより、コアゲノムとアクセサリーゲノムの両方の遺伝子コンテンツが生成される。http://abacus.gene.ucl.ac.uk/software/indelible/を使い、これらのpan遺伝子の配列がfill inされる。

出力についてはGithubで確認して下さい。

引用

Accurate reconstruction of the pan- and core- genomes of bacteria with PEPPA

Zhemin Zhou* and Mark Achtman

bioRxiv preprint, Posted January 03, 2020