アセンブリグラフから二次代謝産物生合成遺伝子クラスターを再構築する BiosyntheticSPAdes

　微生物ゲノムまたはメタゲノムをアセンブリするための多くのツールが存在するが（Simpson et al、2009、Li et al、2015、Nurk et al、2017）、それらはすべて反復ドメインを有するタンパク質をコードする長い遺伝子を含むコンティグに関して限界がある。長い遺伝子は断片化されたアセンブリ中の複数のコンティグ間に分散していることが多いので、既存の遺伝子予測ツール（Besemer et al、2005、Delcher et al、2007、Pati et al、2010、Hyatt et al、2010）はそれらを予測することができない。長い遺伝子を単一のコンティグにアセンブリすることの課題は、非リボソームペプチドシンテターゼ（NRPS）、ポリケチドシンターゼ（PKS）、および抗生物質および他の天然物の産生をコードする生合成遺伝子クラスター（BGC）の一部である他の遺伝子によって説明される。 BGCは通常、天然物の合成に関与する単一の代謝経路に関与する複数の連続した遺伝子を含む。 NRPS BGCはアミノ酸から構築された非リボソームペプチド（NRP）をコードし、PKS BGCはケト基から構築されたポリケチド（PS）をコードする。混合NRPS / PKS BGCはNRPS特異的ドメインとPKS特異的ドメインの両方を含み、それらの天然産物はペプチドとポリケチドの融合を表す（Cane et al。1999）。 Klassen and Currie、2012は、長くて繰り返しの多いNRPSとPKSが微生物アセンブリの断片化の大部分を担っていることを示した。
　NRPは他の種類の天然物と比較して下流のペプチドゲノミクス分析に最も適している重要な種類の天然医薬品であるため（Newman and Cragg、2016）、本稿はNRPSに焦点を当てている（Kersten et al、2011、Mohimani et al 、2014、Medema et al、2014）。 NRPS BGCは、antiSMASHデータベース（https://antismash-db.secondarymetabolites.org/#!/stats）に見られるように、公に入手可能なゲノムにおける全BGCの34％を構成する。 NRPSは非常に一般的なので（とはいえ、
NRPの下流のペプチドゲノミクス解析は断片化されたアセンブリによって大きく損なわれるので、この論文のほとんどの例はNRP NRPS BGCに加えて、biosynteticSPAdesはPKS BGCおよび混合NRPS-PKS BGCにも適用できる（NRPS、PKS、および混合NRPS-PKS BGCはMIBiGデータベースのBGCの大部分を構成している）。Klassen and Currie、2012は、ゲノムアセンブリ中の断片化されたORFがNRPSおよびPKSに非常に富んでいることを示しており、これらはしたがって（メタ）ゲノムアセンブリにおけるブレイクポイントの顕著な原因となる。大多数のゲノムがNRPSまたはPKS、あるいは混合NRPS-PKS BGC（一部の種では、ゲノムの30％以上がこれらのBGCに割り当てられている）のいずれかを含み、大規模研究コミュニティへの直接の関心があるこれらのBGCに専用のアセンブラを提供するのに十分な理由がある。
　NRPSは、各Ａドメインの基質特異性に従ってNRPを形成するアミノ酸を動員することに関与する複数の非常に類似したアデニル化ドメイン（Ａドメイン）を含有する大きなモジュラータンパク質複合体である（Stachelhaus and Marahiel 1999）。 NRPSはしばしば一緒にNRP BGCを形成しそしてNRP合成、輸送および調節に寄与する他の隣接して位置する遺伝子を伴う。 NRP BGCは典型的には長く、平均長さは約60 kbであり、100kbを超えるものもある。 NRP BGCを単一のコンティグにアセンブリすることは、ゲノムマイニング（Weber et al、2015）およびペプチドゲノミクス（Mohimani et al、2014、MohimaniおよびPevzner、2016、Mohimani et al、2017、Gurevich et al、2018）による天然物発見における重要なステップである）。

（２段落省略）

　微生物界内の異なる種からの非リボソームペプチドシンテターゼはしばしば類似のドメインを共有するので、この挑戦はメタゲノミクスアセンブリにおいてさらに増幅される。これは、複数のドメインがアセンブリグラフの単一のエッジに折りたたみ、単一のコンティグにアセンブルすることを難しくする（Coates et al、2014）。したがって、メタゲノムは抗生物質発見のための金脈であるが、これまでのところメタゲノムデータセットからは限られた数の抗生物質が発見されているのみである（Freeman et al、2012、Donia et al、2014、DoniaおよびFischbach、2015）。
　個々のコンティグから長いNRPS BGCを再構築することは困難であるという事実にもかかわらず、アセンブリグラフの構造は、さまざまなコンティグを無傷のBGCに組み合わせる方法に関する手がかりを提供することがよくある。本著者らは、SPAdes （Bankevich et al、2012）およびmetaSPAdes（Nurk et al、2017）アセンブラによって構築されたアセンブリグラフでNRPS BGCをアセンブルするためのBiosyntheticSPAdesについて説明する。論文にて、BiosyntheticSPAdesがどのように様々なゲノムとメタゲノムのNRPS BGC発見に貢献するかを示す。

I’m already very excited to give a keynote speech at the @metasub meeting in Istanbul later this month. I will discuss two of our new algorithms Minerva (development led by @dcdanko) and BiosyntheticSPAdes (development led by @meleshko_da) and their applications #metasub2019
— Iman Hajirasouliha (@hajirasouliha) August 2, 2019

BiosyntheticSPAdes: a new SPAdes flavour to reconstruct putative full-length BGCs from otherwise fragmented draft genome or metagenome assemblies. Was great to collaborate with the Pevzner group! @VTracanna https://t.co/BMzJskh91z
— Marnix Medema (@marnixmedema) June 4, 2019

2020 1/6追記

Finally, SPAdes 3.14 is out! Hybrid transcriptome assembly, bgcSPAdes for identifying Biosynthetic Gene Clusters, plasmid assembly from metagenomic data, new supplementary tools and many more @ https://t.co/NanTNnCQoz
— SPAdes assembler (@spadesassembler) 2019年12月30日

Warning （オーサーより）
We recommend you further validate the BGCs produced by the ranking pipeline with wetlab experiments. This is a reference based method and is not indicated for completely novel BGCs.

インストール

依存

The HMMER suite (3.1b2+)
The (processed) Pfam database. For this, download the latest Pfam-A.hmm.gz file from the Pfam website (direct download)、uncompress it and process it using the hmmpress command.
biopython
numpy
scipy
Antismash

#環境を作って一括導入
conda create -n biosyntheticspades python=3.7 numpy scipy biopython hmmer antismash -c bioconda -y 

#構築済みantismash-DB databaseをダウンロードする
#login: bioinfoshare03
#passwd: H9U66Vn&
sftp bioinfoshare03@sftp.ab.wur.nl:/data/preprocessed_antismash-db.tar.gz .
tar -zxvf preprocessed_antismash-db.tar.gz
mv mnt/scratch/old_scratch/traca001/scripts/genespades/data/preprocessed_antismash-DB/ antismash-DB/
#runではこのantismash-DB/を指定する

biosyntheticSpades Figshare

https://figshare.com/articles/Code_snapshot_of_biosyntheticSPAdes_for_review/6948260

ダウンロードして解凍する。

こちらはBiosyntheticspadesRankingPipelineの方

> python biosyntheticSPAdes_ranking_pipeline.py -h

$ python biosyntheticSpadesRankingPipeline/biosyntheticSPAdes_ranking_pipeline.py -h

usage: biosyntheticSPAdes_ranking_pipeline.py [-h] -i INPUTDIR [-o OUTPUTDIR]

-asdb ANTISMASHDB -bss

BIOSYNTHETICSPADESSTATISTICS

[-v] [-r]

Main wrapper, runs all the different steps.

optional arguments:

-h, --help show this help message and exit

-i INPUTDIR, --inputDir INPUTDIR

Path to Biosynthetic-SPAdes output

-o OUTPUTDIR, --outputDir OUTPUTDIR

Path to the output directory

-asdb ANTISMASHDB, --antismashdb ANTISMASHDB

Path to the antismashdb folder

-bss , --biosyntheticSpadesStatistics

Full path to the biosyntheticSpades

[bgc_statistics.txt] output file.

-v, --verbose Sets the amount of output on screen

-r, --rerun rerun when new version of antismashdb is available.

Prepares the genbank files from antismashDB for the

comparison with the queries putative BGCs

実行方法

１、オーサーらがFigshareにアップしているSPAdes-3.13.0-dev/のspadesかmetaspadesを使ってアセンブルを実行する（spadesは付属の/spades_compile.shを使ってコンパイルしてから使う）。

２、-asdbで準備したantismashのデータベースを指定する。-iでBiosyntheticSPAdesの出力ディレクトリを指定する。-bssでBiosyntheticSPAdesの出力ディレクトリにできるbgc_statistics.txtを指定する。

python biosyntheticSPAdes_ranking_pipeline.py -asdb processed_antismashdb/ -bss assembly/bgc_statistics.txt -i assembly/

まだうまくbgc_statistics.txtの作り方がわからずランできていないので、ランできるようになったら追記します。

引用

BiosyntheticSPAdes: Reconstructing Biosynthetic Gene Clusters From Assembly Graphs
Dmitry Meleshko, Hosein Mohimani, Vittorio Traccana, Iman Hajirasouliha, Marnix H Medema, Anton Korobeynikov, Pavel A Pevzner

Genome Res. 2019 Jun 3. pii: gr.243477.118