2023-06-11

バクテリアの比較ゲノム解析を簡単に行う zDB

　ゲノムの解析と比較は、アノテーション、オルソロジー予測、系統推論などのタスクのために、さまざまなツールに依存している。しかし、ほとんどのツールは単一のタスクに特化しており、結果を統合して可視化するためにはさらなる努力が必要である。このギャップを埋めるために、著者らは解析パイプラインと可視化プラットフォームを統合したアプリケーションであるzDBを開発した。zDBは、Genbankのアノテーションファイルからスタートし、オルソログを特定し、各オルソグループの系統を推論する。また、共有のシングルコピーオルソログから種の系統を構築する。結果は、Pfamタンパク質ドメイン予測、COGおよびKEGG アノテーション、Swissprotホモログで強化することができる。ウェブアプリケーションでは、特定の遺伝子やアノテーションの検索、Blastクエリの実行、ゲノム領域や全ゲノムの比較などが可能である。また、生物の代謝能力をモジュールまたはパスウェイレベルで比較することができる。また、特定の遺伝子やアノテーションの保存性を調べるためのクエリを実行し、結果を遺伝子リスト、ベン図、ヒートマップとして表示することができる。zDBは、数十から数百のゲノムのデータセットをデスクトップマシンで処理するのに完全に適している。

　ゲノム比較や解析は多くの独立したツールに依存しており、科学者は結果を統合して視覚化し、解釈する負担を強いられている。この負担を軽減するために、本著者らは解析パイプラインと可視化プラットフォームの両方を備えた比較ゲノムツールzDBを開発した。解析パイプラインは、遺伝子アノテーション、オルソロジー予測、系統推論を自動化し、可視化プラットフォームは、科学者がウェブブラウザで結果を簡単に探索できるように工夫されている。このインターフェースでは、全ゲノムと対象領域の比較、遺伝子や代謝パスウェイの保存状態の評価、Blast検索の実行、特定のアノテーションの検索などを視覚的に行うことができるのが特徴となっている。本ツールは、2～100のゲノムを対象とした比較研究において、幅広い用途に活用できる。さらに、ローカルまたは国際的な規模でデータセットを簡単に共有できるように設計されており、バイオインフォマティシャンでなくても、お気に入りの生物のゲノムを探索的に分析することができる。

Documentation

https://zdb.readthedocs.io/en/latest/

インストール

condaで環境を作ってテストした。

Github

#conda(link) v1.1.2を試した
mamba create -n zdb python=3
conda activate zdb
mamba install -c metagenlab -c bioconda zdb=1.1.2

> zdb

zDB (v1.1.2)

Available commands:

setup - download and prepare the reference databases

webapp - start the webapp

run - run the analysis pipeline

export - exports the results of a previous run in an archive

import - unpack an archive that was prepared with the export command in the current directory

- so that the results can be used to start the webapp

list_runs - lists the completed runs available to start the website in a given directory

help - print this message

> zdb setup --help

Downloads and sets up the reference database used by the analysis pipeline.

The following options can be used:

--cog: downloads the CDD profiles used for COG annotations

--ko: downloads and setups the hmm profiles of the ko database

--pfam: downloads and setups up the hmm profiles of the PFAM protein domains

--swissprot: downloads and indexes the swissprot database

Other parameters:

--dir: directory where to store the reference databases (defaults zdb_ref in the current directory)

--resume: resume a previously failed execution

Environments (by default, singularity containers are used):

--conda: uses conda environment to prepare the databases

--docker: uses docker containers to prepare the databases

--singularity_dir: the directory where the singularity images are downloaded (default singularity in current directory)

> zdb run --help

Run the analysis pipeline (some analysis may not be available depending on which reference databases were setup)

The following options can be used

--resume: wrapper for nextflow resume. Add this flag to the command that failed to resume the execution where it stopped.

--out: directory where the files necessary for the webapp will be stored

--input: CSV file containing the path to the genbank files to include in the analsysis

--docker: use docker containers instead of singularity

--conda: use conda environments instead of singularity

--name: run name (defaults to the name given by nextflow)

the latest completed run is also named latest

--cog: perform cog annotation

--ko: perform ko (metabolism) annotation

--pfam: peform PFAM domain annotation

--swissprot: search for homologs in the swissprot database

--ref_dir: directory where the reference databases were setup up (default zdb_ref)

--cpu: number of parallel processes allowed (default 8)

--mem: max memory usage allowed (default 8GB)

--num_missing: allows for missing genomes for the determination of core orthogroups (default 0)

--singularity_dir: the directory where the singularity images are downloaded (default singularity in current directory)

> zdb webapp --help

This script starts the web application, using by default the database

that was generated by the latest run of analysis

Arguments:

--dir: the directory where the results (should contain a zdb subfolder) can be found

(defaults to the current directory)

--name=NAME: tells the web application to use a different database than the latest

--port=PORT: the web server will listen on a different port (default: 8080)

--allowed_host=HOSTS: coma separated list that will be passed as argument

to django ALLOWED_HOSTS. If none specified, will try

to guess with the hostname command. This is basically

the URL or IP adress you will using to access the web page.

By default, the web application will be run in a singularity container.

This can be changed with either one of the two following options:

--docker: the web application will be run in a docker instead of

a singularity container

--conda: the web application will be run in a conda environment

--singularity_dir: the directory where the singularity images are downloaded (default singularity in current directory)

> zdb import --help

Unpack an archive that was prepared by the export command

You can alternatively manually unpack it.

The following options can be used

--outdir: specify where the archive will be unpacked

--archive: specify the archive to be unpacked

> $ zdb export --help

Exports the results of a given run name into an archive to make sharing easier

The following options can be used

--dir: specify the directory where the analysis was run

--name: specify the run to be exported

データベース

最小設定のランでは不要だが、いくつかのデータベースに基づいてアノテーションを行うことができる。アノテーションをつけると、webアプリを立ち上げた時に、そのアノテーションが関心のある系統に存在するかどうか調べることができるようになる。必要なら先にダウンロードする。４つのデータベースに対応している。

zdb setup --pfam --swissprot --cog --ko --conda

テストラン

５つのゲノムのgenbankファイルからなるテストデータセットをダウンロードできる。

wget https://github.com/metagenlab/zDB/raw/master/test_dataset.tar.gz
tar xvf test_dataset.tar.gz

解凍すると、refディレクトリ及びゲノムファイルの相対パスを書いたCSVファイルが生じる。

inpuut.csv

１行目はfile、任意で"name"列もつけることができる。name列を書いておくと、その名前がwebアプリ上で使用される。

準備ができたら実行する。デフォルトではsingularityのイメージがダウンロードされて使用される（singularityのv3.8.3以上が予めインストールされている必要がある。）。他に--dockerと--condaが使用できる。

zdb run --input=input.csv --name=simple_run

--resume wrapper for nextflow resume. Add this flag to the command that failed to resume the execution where it stopped.
--out directory where the files necessary for the webapp will be stored
--input CSV file containing the path to the genbank files to include in the analsysis
--docker use docker containers instead of singularity
--conda use conda environments instead of singularity
--name run name (defaults to the name given by nextflow) the latest completed run is also named latest
--cpu number of parallel processes allowed (default 8)
--mem max memory usage allowed (default 8GB)

（一部のオプションは認識しなかった）

計算には3分ほどかかった。

出力

zdb/

続いて、runコマンドを実行したパスでwebアプリをスタートさせる。デフォルトはsingularityだが、condaとdockerにも対応している。"--name"でzdb runコマンド時に指定したname名を指定する。name名が不明なら"zdb list_runs"コマンドで確認できる。

#start webapp
zdb webapp --name=simple_run

--dir the directory where the results (should contain a zdb subfolder) can be found (defaults to the current directory)
--name=NAME tells the web application to use a different database than the latest
--port=PORT the web server will listen on a different port (default: 8080)
--allowed_host=HOSTS comma separated list that will be passed as argument to django ALLOWED_HOSTS. If none specified, will try to guess with the hostname command. This is basically the URL or IP adress you will using to access the web page.
--docker the web application will be run in a docker instead of a singularity container
--conda the web application will be run in a conda environment
--singularity_dir the directory where the singularity images are downloaded (default singularity in current directory)

実行時、singularityならそのイメージがカレントにダウンロードされる。そのため、立ち上がる前にしばらく時間がかかる。

実行例

$ zdb webapp --name=simple_run --port=8090 --allowed_host=127.0.0.1

Starting web server. The application will be accessible @127.0.0.1 on port 8090

立ち上がった。ここではlocalhostと空いているポート番号の"http://127.0.0.1:8090"。

上から順に分析可能な項目を見てみる（写真のデータは自分でアノテーションをつけたゲノム4個）。

Genomes

データベースに含まれるゲノムのリストとその内容の概要が表示される。

ゲノムをクリックするとアノテーション（ユーザーのgenbankファイルに基づく）を確認できる。

Phylogeny

シングルコピーのオルソログを連結した配列からの系統推定結果のツリーが、ゲノムサイズ、GC含有量、コーディング密度、ゲノムの完全性/汚染度などのメタデータと共に表示される（%表示）。

図はSVG形式でダウンロードできる。

Homology search

データベースのゲノムに対してBLASTサーチできる。blastn、blastp、blastx、tblastnが選べる。

Orthology

タンパク質のOrthogroup（orthologous groups）を探索する。５つの分析項目に分かれている。

オルソログのグループ；OrthogroupはOrthoFinderで同定される。OrthoFinderは、BLASTp（パラメータ：-evalue 0.001）の結果に基づき、MCLクラスタリングによってOrthogroupを同定する。

Comparisons

Orthogroupが存在しているゲノムと存在していないゲノムを選択する。

すると、選択されたゲノム間で共有されているオルソロググループのリストが表形式で示される。

全セットデータベースと選択されたゲノムにおけるアノテーションの出現回数が表示される。オルソロググループのアノテーションは、そのグループの全メンバーのアノテーションのコンセンサスであり、コンセンサスとして最も頻度の高い2つのアノテーションのみが報告されている。各オルソロググループについて、遺伝子名、産物、COGカテゴリが表示される（アノテーションオプションを指定していた時のみ）（マニュアルより）。

Orthogroupをクリックすると詳細が表示される。

Homologs

そのOrthogroupの系統樹も確認できる。

（マニュアルより）系統樹はFastTreeを用い、デフォルトのパラメータで再構築されている。ノードサポート値は伝統的なboostrapサポート値ではない。FastTreeは、1,000ブートストラップレプリカを用いた下平・長谷川検定により、ツリーの各分岐の信頼性を迅速に推定する。0.95より高い値は「強く支持されている」と見なすことができる。

Venn diagram

選択されたゲノムまたは一部のゲノムの間で共有されているOrthogroupの数、および各ゲノムに固有のオルソロググループの数をインタラクティブなベン図で視覚化できる。ベン図の下には、各ゲノムで同定されたオルソログの総数が棒グラフで表示される。最後のプロットでは、ユニークまたは共有のオルソログの数を確認することができる（マニュアルより）。

図の番号をクリックすると、同定されたOrthogroupのリストが下に表示される。

クリックして各Orthogroupの詳細を確認できるようになっている。

Presence / absence table

Heatmap

Accumulation plot

（マニュアルより）このプロットは、オルソログの数と考慮したゲノムの数との関係を示している。緑線と青線はいずれも、選択したゲノムの無作為な順列を1回行ったものである。緑色の線は、順列の最初のn個のゲノムに存在するオルソログの総数を表す。青線は、順列の最初のn個のゲノムで共有されているオルソログの数を表す。赤線は、ちょうどn個のゲノムに存在するオルソログの数を表す。このプロットは順列に依存しないので、ポイントをクリックしてより詳細な情報を得ることができる。

（* 試した時は緑のプロットのみ表示された）

Genome alignments

Circosを使ってリファレンスのゲノムと１つ以上のアラインメントを視覚化できる。

リファレンスのゲノムと、比較する１つ上のゲノムを指定する。

最外周の棒グラフ（赤色）は、選択されたゲノムにおける各オルソグループの頻度を示す。その内側の水色/オレンジの細いリングは、リファレンスゲノムのコンティグを示す。3つ目と4つ目の灰色のリングは参照ゲノムの順鎖と逆鎖のオープンリーディングフレームを表す（灰色）。rRNAはピンク色。その内側のカラフルなリングは、1つまたは複数の他ゲノムにおける相同タンパク質の存在（ボルドースケール）/非存在（水色）を示す。最も内側のリング（緑色）は、リファレンスゲノムの各オープンリーディングフレームのGC含量を示す。

3番目の灰色のリングのORFをクリックすると、対応する遺伝子座の詳細情報が表示される。

sequence

トップページ下にスクロールすると、今回の解析ではデータベースのアノテーションに基づく分析は利用できないことが確認できる。

今度はアノテーションのオプションを付けてrunモジュールを実行した時の結果を確認する。まずzdb runを実行してゲノムの解析を行う。

zdb run --input=input.csv --name=more_complete_run --cog --pfam --ko --swissprot

--cog perform cog annotation
--ko perform ko (metabolism) annotation
--pfam peform PFAM domain annotation
--swissprot search for homologs in the swissprot database

それからwebアプリをスタートする。最初のアプリをまだ立ち上げている場合、出力ディレクトリの２つのファイル；zdb/gunicorn/gunicorn.py、zdb/nginx/nginx.configのポート番号をデフォルトの8000から変更しておく必要がある。準備ができたら実行する。

zdb webapp --name=more_complete_run

出力

ANNOTATIONSとMETABOLISMが選択可能になっている。ANNOTATIONSは、選択したゲノムで共有され、他のゲノムでは存在しないオルソグループ、COG、Keggオルソログ、Pfamドメインのリストを調べる機能になっている。（簡単に言えば、ORTHOLOGYと同様の分析機能を、COGやKEGGなどに焦点を当てて行うもの）

COGs

複数の分析項目が用意されている。いくつかは上で紹介したものと同様の分析項目になっている。

Venn diagramを見てみる。

ORTHOLOGYとの違いは、結果がCOGにフォーカスされているかどうか（ORTHOLOGYでは同じorthogroupにクラスタリングされたタンパク質のアノテーションだった）。

最初のコマンドで紹介していないものを簡単に見ていく。

Distribution of COGs within COG categories

クリックするとCOGの一覧が表示される。

Heatmap of frequency of genes identifed in each COG category

METABOLISM

ゲノムに存在する代謝経路を調べることができる。KEGGカテゴリは、各KEGGモジュールにどのKEGGエントリが存在するかを調べることができる。エントリーはリストで可視化したり、注釈付き系統樹を生成するために使用することができる。さらに、KEGG マップでは、特定のパスウェイの遺伝子が目的のゲノムに存在しているかを可視化して確認できる。