ウィルスアノテーションパイプライン VAPiD

　シーケンシング技術がより安価でより入手しやすくなるにつれて、ゲノムシーケンシングはますます普及してきている。小規模のグループでは、単独で分析できるよりも多くのシーケンスデータが生成されている。これらのデータから最大の科学的および公衆衛生的価値を引き出すためには、集められたコンセンサスゲノムおよびrawシーケンシングデータの共有が重要である。ゲノミクスの民主化には皆が参加しないといけない（"The democratization of genomics takes a village"）。これは、検索可能なシーケンシングデータベースにより流行しているウイルスのリアルタイム追跡および食中毒性細菌のアウトブレイク解決を可能にする感染性疾患に特に当てはまる［ref.1、2、3］。最近のhigh-profile casesが最も注目を集めているが、ほとんどすべての感染症が進行中のアウトブレイク中で存在するという事実は変わらない[ref.1、4]。臨床のフィールドでは、メタゲノム解析パイプラインは、より迅速で正確なアラインメントを可能にするために、感染症ゲノムの利用可能性に依存している[ref.5、6、7]。基礎科学の世界では、タンパク質の機能を研究している大学院生は、実験をデザインする前にそのタンパク質の配列多様性に関する歴史を引き出すことができれば大いに助けられる[ref.8]。

　多くの人が、ほぼすべての感染症のゲノムがシーケンシングされ、公的に検索可能なデータベースにアーカイブされる世界を予測している[ref.9、10]。州および連邦の公衆衛生研究所は、毎年6000以上のインフルエンザウイルスゲノムおよび5000以上の腸病原性細菌ゲノムをシーケンシングする能力を構築している[ref.11]。核酸抽出からデータのdepositionおよび分析までのワークフローを合理化することにおける大きな努力は、これらのスループットが急速に成長する事を可能にした［ref.12、13、14］。これらのツールを使用すると、公衆衛生研究所や研究所は、データ保管のための不正なプロトコルではなく、シーケンスから得られる疫学的または科学的洞察に集中することができる。特にゲノムアノテーションの分野では、NCBIが原核生物ゲノムアノテーションパイプライン、真核生物ゲノムアノテーションパイプライン、およびインフルエンザウイルス配列アノテーションツールを作成した[ref.15 link、16 link]。

　驚くべきことに、NCBI GenBankには現在、インフルエンザウイルス以外の自動ウイルスゲノムアノテーションパイプラインがない。 DNAおよびRNAウイルスの信じられないほどの多様性は普遍的なアノテーターの開発のチャレンジを提示している[ref.17]。 RNA編集、リボソームスリップ、およびオーバーラップリーディングフレームを含む複雑なウイルス生活環は、ウイルス遺伝子産物についての非標準命名法と共に、さらなるアノテーションの問題を引き起こす［ref.18、19、20］。

　Submissionされたウイルスゲノムデータを受け入れるために、NCBI GenBankは、1）少なくとも1つのタンパク質アノテーションを含むウイルス配列、2）著者/寄託者メタデータ、および3）株、収集日、収集場所、報道などのウイルスシーケンスメタデータを必要とする。ヌクレオチド配列の手動アノテーションは少数のウイルスに対して行うことができるが、それは非常に時間がかかりそして労働集約的である。すべてのウイルス配列に対して正しいアノテーションが得られた後でも、送信者が作成するファイルを作成するために作者とサンプルのメタデータを手動で統合することは、同様に時間がかかり、少数のウイルスをシーケンスするグループにとって実行可能な解決策ではない。

　現在までのところ、既存のウイルスアノテーションをツールは、単一のウイルス種の一括submissionに主に焦点を当ててきた。これは、ウイルスの全ゲノム配列を忠実に回復するために特定のPCRベースの方法が必要とされたときから、または一度に単一のウイルスに研究者が集中することからの名残であり得る。メタゲノムまたはショットガンの次世代シーケンシング増加とシーケンシング能力の増加により、研究者は1回のシーケンシングランで自信を持って多数の異なるRNAまたはDNAウイルスをまとめてバッチ処理できる。

　ウイルスゲノムのアノテーションを容易にするために、完全またはほぼ完全なウイルスゲノムのFASTAファイルを入力として受け取り、それらに自動的にアノテーションを付け、GenBankへの提出に必要なファイルを電子メールで出力する軽量でユーザーフレンドリーなコマンドラインツールを開発した。 VAPiDは単純な1行のコマンドで実行でき、ウイルス種の知識がなくてもさまざまな種類の複数のウイルスの一括送信を処理し、RNA編集とリボソームのずれを正しくアノテーションし、アノテーションのスペルチェックを実行し、メタデータの一括送信または個別送信を処理し、 GenBank登録用のアノテーション付きウイルス配列ファイルを作成する。

General design and information flow of VAPiD. 論文より転載。

VAPiD can perform three different types of viral annotation（Githubより）

Rapid search of a local compressed viral database for any set of viruses with a reference genome (recommended)
Annotation of a single viral species based on a preferred reference genome using a single Genbank accession number.
Comprehensive web NT database search for all viral sequences (much slower)

依存（Githubより）

Python - tested almost exclusively on python 2.7.14. Python 3 and above have syntax issues and actually break when you try to manually enter metadata.

Ensure you have python with numpy and biopython, mafft, and blast+ installed locally and on your path. Then you'll need to install tbl2asn and put it on your path.

本体　Github

データベース

https://github.com/rcs333/VAPiD/releases

> python vapid.py -h

$ python vapid.py -h

usage: vapid.py [-h] [--metadata_loc METADATA_LOC] [--r R] [--f F] [--db DB]

[--online] [--spell_check] [--all] [--slashes] [--dna]

fasta_file author_template_file_loc

Version v1.6.3 Prepares FASTA file for NCBI Genbank submission through local

or online blastn-based annotation of viral sequences. In default mode, VAPiD

searches this folder for our viral databases.

positional arguments:

fasta_file Input file in .fasta format containing complete or

near complete genomes for all the viruses that you

want to have annotated

author_template_file_loc

File path for the NCBI-provided sequence author

template file (should have a .sbt extension) https://s

ubmit.ncbi.nlm.nih.gov/genbank/template/submission/

optional arguments:

-h, --help show this help message and exit

--metadata_loc METADATA_LOC

If you've input the metadata in the provided csv,

specify the location with this optional argument.

Otherwise all metadata will be manually prompted for.

--r R If you want to specify a specific NCBI reference, put

the accession number here - must be the exact

accession number - note: feature forces all sequences

in FASTA to be this viral species.

--f F specify a custom gbf file that you would like to

annotate off of

--db DB specify the local blast database name. You MUST have

blast+ with blastninstalled correctly on your system

path for this to work.

--online Force VAPiD to blast against online database. This is

good for machines that don't have blast+ installed or

if the virus is really strange.Warning: this can be

EXTREMELY slow, up to ~5-25 minutes a virus

--spell_check Turn on spellchecking for protein annoations

--all Use this flag to transfer ALL annotations from

reference, this is largely untested

--slashes Use this flag to allow any characters in the name of

your virus - This allows you to submit with a fasta

file formated like >Sample1 (Human/USA/2016/A)

Complete CDS make sure that your metadata file only

contains the first part of your name 'Sample1' in the

example above. You can also submit names with slashes

by specifying in the metadata sheet under the header

full_name, if you do that you do not need to use this

flag

--dna Make all files annotated by this run be marked as DNA

instead of the default (RNA)

usage: vapid.py [-h] [--metadata_loc METADATA_LOC] [--r R] [--f F] [--db DB]

[--online] [--spell_check] [--all] [--slashes] [--dna]

fasta_file author_template_file_loc

Version v1.6.3 Prepares FASTA file for NCBI Genbank submission through local

or online blastn-based annotation of viral sequences. In default mode, VAPiD

searches this folder for our viral databases.

positional arguments:

fasta_file Input file in .fasta format containing complete or

near complete genomes for all the viruses that you

want to have annotated

author_template_file_loc

File path for the NCBI-provided sequence author

template file (should have a .sbt extension) https://s

ubmit.ncbi.nlm.nih.gov/genbank/template/submission/

optional arguments:

-h, --help show this help message and exit

--metadata_loc METADATA_LOC

If you've input the metadata in the provided csv,

specify the location with this optional argument.

Otherwise all metadata will be manually prompted for.

--r R If you want to specify a specific NCBI reference, put

the accession number here - must be the exact

accession number - note: feature forces all sequences

in FASTA to be this viral species.

--f F specify a custom gbf file that you would like to

annotate off of

--db DB specify the local blast database name. You MUST have

blast+ with blastninstalled correctly on your system

path for this to work.

--online Force VAPiD to blast against online database. This is

good for machines that don't have blast+ installed or

if the virus is really strange.Warning: this can be

EXTREMELY slow, up to ~5-25 minutes a virus

--spell_check Turn on spellchecking for protein annoations

--all Use this flag to transfer ALL annotations from

reference, this is largely untested

--slashes Use this flag to allow any characters in the name of

your virus - This allows you to submit with a fasta

file formated like >Sample1 (Human/USA/2016/A)

Complete CDS make sure that your metadata file only

contains the first part of your name 'Sample1' in the

example above. You can also submit names with slashes

by specifying in the metadata sheet under the header

full_name, if you do that you do not need to use this

flag

--dna Make all files annotated by this run be marked as DNA

instead of the default (RNA)

１、.sbtファイル。これはNCBIのページ（リンク）で作成できる（記入してgenerateボタンを押す）。Githubレポジトリにはexample.sbtが含まれる（実際のsubmisisionに使わない事）。

２、metadataファイル。example fastaとそのmetadataがGithubレポジトリに含まれる。

３、データベースファイル（virusのblast+データベース）。こちらからall_virus.zip
をダウンロードし、解凍してall_virus.fasta.nhr、all_virus.fasta.nin、all_virus.fasta.nsqをVAPiD/に置く。

テストラン

cd VAPiD-master/
python vapid.py example.fasta example.sbt --metadata_loc example_metadata.csv

引用

VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank

Ryan C. Shean, Negar Makhsous, Graham D. Stoddard, Michelle J. Lin, Alexander L. Greninger
BMC Bioinformatics 2019 20:48

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ウィルスアノテーションパイプライン VAPiD