タンパク質のドメインレベルのアノテーションを行う Domainator

2024/06/03 タイトル修正、誤字修正

　生物学的配列の大規模なデータベースが利用可能になったことで、遺伝子の多様性と機能を深く探求する機会がもたらされた。細菌防御系は、多様であるがアノテーションが困難な遺伝子の豊富な供給源であり、バイオテクノロジーへの応用が期待される。この研究では、ドメインベースの遺伝子近傍およびタンパク質の検索、抽出、クラスタリングのための柔軟でモジュール化されたソフトウェアスイートであるDomainatorを紹介する。Domainatorの有用性を、細菌防御システムに関する3つの例を通して実証する。第一に、CRISPR-associated Rossman fold (CARF)を含む、エフェクタードメインの注釈付けが困難なタンパク質をクラスタリングし、そのほとんどを転写制御因子として、サブセットをRNAとして分類した。第二に、P4様ファージサテライト防御ホットスポットを抽出しクラスター化し、ラマースファージ防御システムに関連する豊富なシステムを同定した。第三に、タンパク質言語モデルをDomainatorに統合し、既知の参照配列との相同性が低い制限酵素を同定するために使用し、その一例の活性をin-vitroで検証した。Domainatorは、詳細なドキュメントと使用例を含むオープンソースパッケージとして公開されている。

レポジトリより

Domainatorは、配列データとアノテーションデータの両方のキャリアとしてGenBankファイルフォーマットを使用する。HMMプロファイル、タンパク質配列、またはその両方と同時に比較することで、配列に機能アノテーションを追加できる。例えば、Domainateを1回呼び出すだけで、ゲノムやメタゲノムからなるコンティグのセットに、Pfam HMM-profileへのヒットとREBASE Gold Standardタンパク質配列へのヒットを同時にアノテーションすることができる。

Domainatorは、コマンドラインまたはPython スクリプトとして、幅広いワークフローから構成される、数十の個別の柔軟なプログラムを提供している。Domainatorを構成する個々のプログラムは、ワークフローにおける典型的な役割に対応する6つのカテゴリーに大別できる（レポジトリの図の色分けされた部分）。

インストール

・DomainatorはLinuxで開発され、Macでもテストされている。Singularity（Apptainer ）のイメージのビルドにも対応する。ubuntu22にインストールした。

Github

#環境を作る
git clone https://github.com/nebiolabs/domainator.git
cd domainator/
mamba env create -f conda_env.yml
conda activate domainator

> domainate.py -h

usage:

version: 0.7.0

Annotate sequence files with hmm profiles

For each CDS or protein in the input file, run the hmmer3 against the target domain hmm database to annotate all the different known/detectable domains in the CDS.

Hits with E value lower than the threshold (default 10) will be added to the annotation in genbank output file as a new "feature"

NOTES: genbank files that contain translation annotations in their CDS features will run much faster than those without translation annotations.

domainate.py stores the entire reference set (usually the hmm database) in memory, so it is not suitable for cases where the reference

set is larger than your system memory (this may change in future versions).

[-h] -i INPUT [INPUT ...] [--fasta_type {nucleotide,protein}] [-r REFERENCES [REFERENCES ...]] [--foldseek FOLDSEEK [FOLDSEEK ...]] [--esm2_3Di_weights ESM2_3DI_WEIGHTS] [--esm2_3Di_device ESM2_3DI_DEVICE] [-o OUTPUT] [-Z Z] [-e EVALUE] [--min_evalue MIN_EVALUE]

[--max_domains MAX_DOMAINS] [--max_mode] [--include_taxids INCLUDE_TAXIDS [INCLUDE_TAXIDS ...]] [--exclude_taxids EXCLUDE_TAXIDS [EXCLUDE_TAXIDS ...]] [--ncbi_taxonomy_path NCBI_TAXONOMY_PATH] [--taxonomy_update] [--cpu CPU] [--max_overlap MAX_OVERLAP] [--hits_only]

[--no_annotations] [--batch_size BATCH_SIZE] [--offset OFFSET] [--recs_to_read RECS_TO_READ] [--gene_call {unannotated,all}]

options:

-h, --help show this help message and exit

-i INPUT [INPUT ...], --input INPUT [INPUT ...]

the genbank or fasta files to annotate. Genbank files can be nucleotide (with CDS annotations) or peptide. Fasta files must be peptide. (default: None)

--fasta_type {nucleotide,protein}

Whether the sequences in fasta files are protein or nucleotide sequences. (default: protein)

-r REFERENCES [REFERENCES ...], --references REFERENCES [REFERENCES ...]

the names of the HMM files with profiles to search. Or reference protein fasta files. (default: None)

--foldseek FOLDSEEK [FOLDSEEK ...]

Foldseek database files. (default: None)

--esm2_3Di_weights ESM2_3DI_WEIGHTS

checkpoint file for esm2_3Di model. (default: None)

--esm2_3Di_device ESM2_3DI_DEVICE

device to use for esm2_3Di model. (default: cuda:0)

-o OUTPUT, --output OUTPUT

output genbank filename. If not supplied, writes to stdout. (default: None)

-Z Z Passed as the -Z parameter to hmmsearch: Assert that the total number of targets in your searches is <x>, for the purposes of per-sequence E-value calculations, rather than the actual number of targets seen. Default: 1000

Set to 0 to use the number actual target sequences (currently this does does not account for taxonomic filtering).Supplying -Z is recommended for most use cases of domainate, as it will speed up the search and make comparisons between searches more meaningful. (default: 1000)

-e EVALUE, --evalue EVALUE

threshold E value for the domain hit. Must be <=10. [default 0.001] (default: 0.001)

--min_evalue MIN_EVALUE

hits with E value LOWER (as in BETTER) than this will be filtered out. NOT FREQUENTLY USED. Use only if you want to eliminate close matches. Must be >=0. [default 0] (default: 0.0)

--max_domains MAX_DOMAINS

the maximum number of domains to be reported per CDS. If not specified, then no max_domains filter is applied. (default: 0)

--max_mode Run hmmsearch/phmmer in maximum sensitivity mode, which is much slower, but more sensitive. (default: False)

--include_taxids INCLUDE_TAXIDS [INCLUDE_TAXIDS ...]

Space separated list of taxids to include. Contigs with taxonomy not in this list will be skipped. (default: None)

--exclude_taxids EXCLUDE_TAXIDS [EXCLUDE_TAXIDS ...]

Space separated list of taxids to exclude. Contigs with taxonomy in this list will be skipped. (default: None)

--ncbi_taxonomy_path NCBI_TAXONOMY_PATH

Path to NCBI taxonomy database directory. Will be created and downloaded if it does not exist. (default: /tmp/ncbi_taxonomy)

--taxonomy_update If taxonomy database exists, check it against the version on the ncbi server and update if there is a newer version. (default: False)

--cpu CPU the number of cores of the cpu which are used at a time to run the search [default: use all available cores] (default: 0)

--max_overlap MAX_OVERLAP

the maximum fractional of overlap between domains to be included in the annotated genbank. If >= 1, then no overlap filtering will be done. (default: 1)

--hits_only when activated, the ouptut will only have contigs with at least one domain annotation. In many cases, domain_search.py will be faster and more appropriate. (default: False)

--no_annotations when activated, new annotations will not be added to the file. This is useful if you are just trying to extract a set of contigs for annotation in a later step. (default: False)

--batch_size BATCH_SIZE

How many protein sequences to search at one time in a batch. Increasing this number might improve speeds at the cost of memory. (default: 10000)

--offset OFFSET File offset to start reading records at. (Note that this usually only makes sense when there is just a single input file). Default: start at the top of the file (default: None)

--recs_to_read RECS_TO_READ

Stop after reading this many records. Default: read all the records (default: None)

--gene_call {unannotated,all}

When activated, new CDS annotations will be added with Prodigal in Metagenomic mode. If 'all', then any existing CDS annotations will be deleted and all contigs will be re-annotated. If 'unannotated', then only contigs without CDS annotations will be annotated. [default: None]. If you supply a nucleotide fasta file, be sure to also specify --fasta_type nucleotide. Note that if you do gene calling, it is STRONGLY recommended that you also supply -Z, because database size is pre-calcuated at the beginning of the execution, whereas gene-calling is done on the fly. Not supplying Z may become an error in the future. (default: None)

テストラン

domainate.pyのラン - HMMのデータベースのリソースが必要。レポジトリではPfamを使っている。GenBankファイルもレポジトリに含まれないので、適当なものを用意する。

wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz

# Domainatorのラン。genbankは-iで、アノテーションDBは-rで指定する。
domainate.py --cpu 12 --max_overlap 0.6 -i genbank.gb -r Pfam-A.hmm -o domainator_output.gb

domainator_output.gbが出力される。

（大規模なデータベースに対して少数の hmm を検索する場合は domain_search.py を使う）

結果を見てみる。

まずオリジナルのアノテーション

DomainatorでP-famのリソースからのドメインアノテーションを追加後

summary_report.pyを使うと、配列数やドメインの種類ごとのアノテーション結果などをグラフや統計にまとめられる。

summary_report.py -i domainator_output.gb --html domain_summary.html

domain_summary.html

（以下省略）

enum_report.pyを使うと、domainatorでアノテーションしたgenbankファイルから、各コンティグやcdsのドメイン内容など、入力に関連するプロパティやメタデータのタブ区切りレポートを生成できる。

enum_report.py -i domainator_output.gb --by contig --name --taxname superkingdom genus species self --length --domains --html enum_report.html -o enum_report.tsv

各行がゲノムコンティグ、タンパク質、ドメインに対応し、値が長さ、分類学ID、ドメインコンテンツなどのデータであるタブ区切りのファイルが作成される。レコード単位のレポートは、GenBankやhmmファイルを読めないExcelなどのプログラムにデータをエクスポートするのに便利で、Domainatorのいくつかのプログラム間の中間ファイルとしても利用できる。

plot_contigs.pyを使うと、コンティグとそのアノテーションを視覚化できる。

plot_contigs.py -i domainator_output.gb --html contigs_plot.html

contigs_plot.html

Domainatorには全部で以下のようなコマンドが用意されている（レポジトリより）。

・build_projection.py - 2次元または3次元のumap, tsne, pca投影を生成

・build_ssn.py - 配列類似性ネットワーク(SSN)を持つCytoscapeファイルを生成

・build_tree.py - 距離行列をUPGMAツリーに変換

・color_genbank.py - dominatorのアノテーションに基づき、CDSおよび/またはdominatorのフィーチャーに色の修飾子を作成

・color_table_to_legend.py - アノテーション名とカラーコードをタブで区切ったカラーテーブルファイルに基づいてSVG形式の凡例を生成

・compare_contigs.py - domain adjacency indexやdomain jaccard indexのような指標に基づいてコンティグ間の類似性を計算

・deduplicate_genbank.py -genbankファイルからcdhitやusearchのようなクラスタリングアルゴリズムを実行して冗長性を減らす

・domain_search.py - コンティグまたはタンパク質配列をタンパク質ドメインでアノテーションする（大規模なデータベースに対応）

・domainator_db_download.py - dominatorに適した形式でデータベースをダウンロードする

・enum_report.py - genbankファイルから、入力に関連するプロパティやメタデータのレポートを生成

・extract_domains.py - domainator-genbankファイルを受け取り、指定されたドメインを抽出

・extract_peptides.py - ヌクレオチドのgenbank配列から翻訳した配列を抽出、domainatorのアノテーションを取り出す（他のアノテーションは失われる）。

・extract_unannotated.py - domainatorでアノテーションを付けたタンパク質のgenbankファイルからドメインアノテーションでカバーされていない領域を抽出

・filter_domains.py - domainatorでアノテーションを付けたgenbankファイルに対して、より厳密なフィルタリングを行う

・genbank_to_fasta.py - genbankファイルをfastaファイルに変換

・hmmer_build.py - hmmbuildを実行。HMMER3パッケージのhmmbuildと違い、コマンドラインからaccessionとdescriptionを指定できる

・hmmer_compare.py - ローカルプロファイルアラインメントを実行し、ペアごとのhmmスコアと位置ごとのアラインメントを計算

・hmmer_report.py - .hmm ファイルから表形式のレポートを生成

・hmmer_search.py - hmmプロファイルの参照セットとの類似性スコアに基づいてhmmプロファイルを選択

・hmmer_select.py - .hmm ファイルからプロファイルのサブセットを抽出

・matrix_report.py - データ行列からヒストグラムと要約統計量を生成

・partition_seqfile.py - genbankやfastaファイルをスキャンして、ファイルをサブセットに分割するためにファイルのバイトオフセット（特定のバイト位置にアクセスする目的で、データの位置をバイト単位で表す値）を見つける

・plot_contigs.py - コンティグとそのアノテーションをグラフィカルに表示（テストラン）

・select_by_cds.py - 個々のCDS内のCDS名やドメイン内容に基づいてコンティグやコンティグ領域を抽出

・select_by_contig.py - コンティグ全体のドメイン内容に基づいてコンティグのサブセットを抽出

・seq_dist.py - タンパク質配列間の類似性スコアを計算

・summary_report.py - コンティグ数、平均コンティグ長、各ドメインの頻度などの統計情報を計算（テストラン）

・transform_matrix.py - スコア形式の行列を様々な正規化されたスコア形式の行列に変換

・trim_contigs.py - コンティグの両端をトリミング

（使用するファイルフォーマットはレポジトリで説明されている。）

その他（レポジトリより）

Domainatorのプログラムでは、他のソフトウェアとは異なり、編集するファイルは-i引数で与える。編集の基準は-r引数のような他の引数で与える（参照配列やhmmプロファイルなど）。例えば、UniProtデータベースのクエリーに対するヒットの検索は、”domain_search.py -i uniprot.fasta -r query.hmm -o hits.gb”で行える。同様に、Pfamのアノテーションを持つコンティグのセットは”domainate.py -i unannotated.gb -r pfam-A.hmm -o annotated.gb”で行える。
コンティグやHMMプロファイル間のペアワイズスコアや距離行列を生成する比較プログラムが用意されている。compare_contigs.pyはJaccardとadjacencyインデックスを使用して、ドメインコンテンツに基づいてタンパク質や遺伝子近傍を比較する。seq_dist.pyはphmmer、Diamond、hmmsearch、Viterbiプロファイル比較アルゴリズムなどのローカルアライメントスコアを使用する。

引用

Domainator, a flexible software suite for domain-based annotation and neighborhood analysis, identifies proteins involved in antiviral systems
Sean R. Johnson, Peter Weigele, Alexey Fomenkov, Andrew Ge, Anna Vincze, Richard J. Roberts, Zhiyi Sun

bioRxiv, Posted April 26, 2024.