ゲノム配列からウィルス配列を同定してアノテーションをつける VIBRANT

　細菌や古細菌に感染するウイルスは世界的に豊富であり、ほとんどの環境で宿主の数を上回っている [ref.1,2,3]。ウイルスは、感染時に宿主細胞の代謝状態を再プログラムすることができる義務的な細胞内病原性遺伝要素であり、多様な環境下で毎日20～40％の微生物を溶解させる可能性がある[ref.4, 5]。ウイルスは、その豊富さと広範な活動により、炭素、窒素、リン、硫黄などの必須栄養素の循環に寄与しているため、微生物群集の重要な要素となっている[ref.6,7,8,9,10]。ヒトのシステムでは、ウイルスは炎症性腸疾患などの様々な疾患を引き起こす可能性のある共生障害に関与していることが示唆されており、免疫システムとの共生的な役割を果たしていることさえある[ref.11,12,13]。

　ウイルスは、遺伝子の内容、配列、コードされた機能の多様性に大きな可能性を秘めている[ref.14,15,16,17]。その遺伝的多様性を認識し、新規抗微生物薬候補、バイオテクノロジー応用のための酵素、バイオレメディエーションのためのこれらのウイルス配列を「マイニング」することに大きな関心が寄せられている[ref.18,19,20,21,22]。最近では、ウイルスが代謝プロセスを特異的に駆動することにより、栄養素の生物地球化学的循環を直接結びつける可能性があることがわかってきている[ref.23,24,25,26,27]。例えば、ウイルスは感染時に、宿主の代謝を引き継いで指示することで、必要な栄養素の40～90％を周囲の環境から獲得することができる[ref.28,29,30]。宿主の代謝フレームワークを操作するために、一部のウイルスは宿主から代謝遺伝子を選択的に「盗む」。これらの宿主由来の遺伝子は、補助代謝遺伝子（AMG）と総称され、感染時に積極的に発現し、ウイルスにフィットネス上の優位性を与えることができる [ref.31,32,33,34]。マイクロバイオームにおけるウイルスの役割を研究し、生態系機能のモデルにウイルスを組み込む必要性から、微生物群集全体の中でどのような配列がウイルスに由来するかを決定することが非常に重要になってきた。これらの配列には、遊離ウイルス、活動的な細胞内感染（これは、ある時点で全細菌の30％にも及ぶ可能性がある[ref.35]）、粒子または宿主に付着したウイルス[ref.36]、および宿主に統合されたウイルスゲノムまたはエピソームウイルスゲノム（すなわち、プロウイルス）が含まれる。

（一部省略）

　より最近のツールはVirSorterの代替または補足ツールとして開発されてきた。VirFinder [ref.41]は、機械学習を実装し、ウイルス予測のアノテーションデータベースから完全に独立した最初のツールであり、後にPPR-Meta [ref.42]で実装されたプラットフォームであった。VirFinderは、ウイルスが8ヌクレオチド頻度の特徴的なパターン（8-merと呼ばれる）を示す傾向があることを考慮して構築されたが、これはウイルスが宿主と非常に類似したヌクレオチドパターンを共有できるという知識にもかかわらず提案されたものである[ref.43]。これらの8-merパターンは、500 bp以下の短い配列を迅速に分類し、モデルに基づいたスコアを生成するために使用されるが、スコアのカットオフを定義するのはユーザー次第である。VirFinderはVirSorterと比較してウイルスの回収能力を大幅に向上させることが示されたが、機械学習モデルのトレーニング中におそらくはリファレンスデータベースに関連したバイアスから、多様なウイルスを予測する際にホストとソース環境にかなりのバイアスがかかっていることも示された[ref.41]。さらに、特定の環境からのウイルスの回収率が低いことも確認されている[ref.44]。

　これまでのところ、VirSorterは、メタゲノムアセンブリ内の統合されたプロウイルスを同定するための最も効率的なツールである。他のツール、主にPHASTER [ref.47]とProphage Hunter [ref.48]は、メタゲノムアセンブリによって生成されたscaffoldsではなく、ゲノム全体からintegratiojnされたプロウイルスを同定することに特化したツールである。VirSorterと同様に、これら2つのプロウイルス予測器は、ウイルスに属する宿主ゲノムの領域を特定するために、スライドウィンドウを備えたリファレンス相同性とウイルス配列シグネチャに依存している。これらはゲノム全体には有用であるが、溶菌性（すなわち非組み込み）ウイルスに属するscaffoldsを同定する能力はなく、大規模なデータセットでは動作が遅い。さらに、PHASTERとProphage Hunterはどちらもウェブベースのサーバーとしてのみ利用可能であり、スタンドアロンのコマンドラインツールは提供されていない。

　ここで著者らはVIBRANT（Virus Identification By iteRative ANnoTation）を開発した。VIBRANTは、メタゲノムアセンブリとゲノム配列から遊離のウイルスとintegratedウイルスの両方を自動で回収、アノテーション、キュレーションするためのツールである。VIBRANTは、細菌と古細菌の両方に感染する多様なdsDNA、ssDNA、RNAウイルスを同定することができる。VIBRANTは、隠れマルコフモデル（HMM）を用いた非リファレンスベースの類似性検索から得られたタンパク質アノテーションシグネチャのニューラルネットワークと独自の「v-score」指標を用いて、多様なウイルスや新規ウイルスの同定を最大化する。ウイルスを特定した後、VIBRANTは予測を検証するためのキュレーションステップを実行する。VIBRANTはさらに、AMGをハイライトすることでウイルス群集の機能を特徴づけ、ウイルス群集に存在する代謝パスウェイを評価する。すべてのウイルスゲノム、タンパク質、アノテーション、代謝プロファイルは、ユーザーが使いやすい下流の解析と可視化のためのフォーマットにコンパイルされている。リファレンスウイルス、非リファレンスウイルスデータセット、様々なメタゲノムに適用した場合、VIBRANTはウイルスの回収率を最大化し、誤発見を最小限に抑えるという点で、VirFinder、VirSorter、MARVELを上回った。PHASTER、Prophage Hunter、VirSorterと比較した場合、VIBRANTは宿主scaffoldsからintegrationされたプロウイルス領域を抽出する能力において、同等の性能を示した。VIBRANTはまた、様々な環境に由来するウイルス間の代謝能力の違いを特定するためにも使用された。クローン病患者の3つのコホートにVIBRANTを適用したところ、健常者と比較してウイルスの種類が異なるだけでなく、病気の状態に影響を与えると考えられるウイルスがコードされた遺伝子を同定することができた。VIBRANTは https://github.com/AnantharamanLab/VIBRANT から無料でダウンロードできる。また、VIBRANTはCyVerse Discovery Environment（https://de.cyverse）を介して、ユーザーフレンドリーなウェブベースのアプリケーションとしても利用できる。

flowchart. Githubより

My first first author paper is out! VIBRANT, software to identify and characterize microbial viruses from genomic data. @AnantharamanLab @KarthikGeomicro @ZhichaoZhou_CHN https://t.co/bX22DBve6p
— Kristopher Kieft (@KrisKieft) 2020年6月11日

Interested in viruses in metagenomes/genomes? Our new software VIBRANT built by Ph.D student @KrisKieft will identify and annotate viruses, predict metabolic genes, genome quality, and provides visual summaries. Preprint and docker image coming soon. https://t.co/imLPyGagNd
— Karthik Anantharaman (@KarthikGeomicro) 2019年10月16日

インストール

依存をcondaで導入し、VIBRANTのコードをpullしてテストした。テストしていないが、現在はbiocondaからインストールできる。

依存

VIBRANT has been tested and successfully run on Mac, Linux and Ubuntu systems.

Python3: https://www.python.org (version >= 3.5)
Prodigal: https://github.com/hyattpd/Prodigal
HMMER3: https://github.com/EddyRivasLab/hmmer
gzip: http://www.gzip.org/
tar: https://www.gnu.org/software/tar/
wget: https://www.gnu.org/software/wget/

Python Dependancies

BioPython: https://biopython.org/wiki/Download
Pandas: https://pandas.pydata.org/pandas-docs/stable/install.html
Matplotlib: https://matplotlib.org/
Seaborn: https://seaborn.pydata.org/
Numpy (version >= 1.17.0): https://numpy.org/
Scikit-learn: https://scikit-learn.org/stable/
Pickle: https://docs.python.org/3/library/pickle.html

本体　Github

#python3.7の仮想環境を作って入れる
conda create -n Vibrant -c bioconda -y hmmer Prodigal biopython python=3.7
conda activate Vibrant
conda install -c anaconda -y pandas matplotlib numpy seaborn scikit-learn

git clone https://github.com/AnantharamanLab/VIBRANT.git
chmod -R 777 VIBRANT
cd VIBRANT/

#or Bioconda(link)
conda install -c bioconda vibrant -y

> python3 VIBRANT_run.py -h

$ python3 VIBRANT_run.py -h

usage: VIBRANT_run.py [-h] [--version] -i I [-f {prot,nucl}] [-t T] [-l L]

[-o O] [-virome] [-no_plot] [-k K] [-p P] [-v V] [-e E]

[-a A] [-c C] [-n N] [-s S] [-m M] [-g G]

Usage: python3 VIBRANT_run.py -i <input_file> [options]. VIBRANT identifies

bacterial and archaeal viruses (phages) from assembled metagenomic scaffolds

or whole genomes, including the excision of integrated proviruses. VIBRANT

also performs curation of identified viral scaffolds, estimation of viral

genome completeness and analysis of viral metabolic capabilities.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

-i I input fasta file

-f {prot,nucl} format of input [default="nucl"]

-t T number of parallel VIBRANT runs, each occupies 1 CPU

[default=1, max of 1 CPU per scaffold]

-l L length in basepairs to limit input sequences [default=3000,

can increase but not decrease]

-o O number of ORFs per scaffold to limit input sequences

[default=4, can increase but not decrease]

-virome use this setting if dataset is known to be comprised mainly

of viruses. More sensitive to viruses, less sensitive to

false identifications [default=off]

-no_plot suppress the generation of summary plots [default=off]

-k K path to KEGG HMMs (if moved from default location)

-p P path to Pfam HMMs (if moved from default location)

-v V path to VOG HMMs (if moved from default location)

-e E path to plasmid HMMs (if moved from default location)

-a A path to viral-subset Pfam HMMs (if moved from default

location)

-c C path to VIBRANT categories file (if moved from default

location)

-n N path to VIBRANT annotation to name file (if moved from

default location)

-s S path to VIBRANT summary of KEGG metabolism file (if moved

from default location)

-m M path to VIBRANT neural network machine learning model (if

moved from default location)

-g G path to VIBRANT AMGs file (if moved from default location)

> python3 scripts/VIBRANT_annotation.py -h

$ python3 scripts/VIBRANT_annotation.py -h

usage: VIBRANT_annotation.py [-h] [--version] -i I [-f {prot,nucl}] [-l L]

[-o O] [-virome] [-k K] [-p P] [-v V] [-e E]

[-a A] [-c C] [-n N] [-m M] [-g G]

See main wrapper script: VIBRANT_run.py. This script performs the bulk of the

work but is not callable on its own.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

-i I input fasta file

-f {prot,nucl} format of input [default="nucl"]

-l L length in basepairs to limit input sequences [default=3000,

can increase but not decrease]

-o O number of ORFs per scaffold to limit input sequences

[default=4, can increase but not decrease]

-virome use this setting if dataset is known to be comprised mainly

of viruses. More sensitive to viruses, less sensitive to

false identifications [default=off]

-k K path to KEGG HMMs (if moved from default location)

-p P path to Pfam HMMs (if moved from default location)

-v V path to VOG HMMs (if moved from default location)

-e E path to plasmid HMMs (if moved from default location)

-a A path to viral-subset Pfam HMMs (if moved from default

location)

-c C path to VIBRANT categories file (if moved from default

location)

-n N path to VIBRANT annotation to name file (if moved from

default location)

-m M path to VIBRANT neural network machine learning model (if

moved from default location)

-g G path to VIBRANT AMGs file (if moved from default location)

データベースの準備

データベースをダウンロードする。11GB程度のディスクスペースが必要。

mkdir databases && cd databases
python3 VIBRANT_setup.py

#verify
python3 VIBRANT_test_setup.py

Done. Several new databases are now in this folder.

VIBRANT should be ready to go. You can verify this by running VIBRANT_test_setup.py within this folder (databases/)

# verify

$ python3 VIBRANT_test_setup.py

VIBRANT v1.0.0 is good to go!

テストラン

アセンブリのFASTAファイルを指定する（ヘッダにfragmentという文字があってはならない）。アミノ酸 FASTAも使用可能だが、ソートのされ方にルールがある。詳しくはGithub参照。

git clone https://github.com/AnantharamanLab/VIBRANT.git
cd VIBRANT/example_data/
python3 ../VIBRANT_run.py -i ../example_data/mixed_example.fasta -t 12

-t number of parallel VIBRANT runs, each occupies 1 CPU [default=1, max of 1 CPU per scaffold]
-i input fasta file
-f format of input [default="nucl"]
-l length in basepairs to limit input sequences [default=3000, can increase but not decrease]