UniProtのデータベースから機能的アノテーションとID mappingを行う UPIMAPI

2022/07/12 修正

2023/03/05 追記

　オミックスやメタオミックス技術は、微生物の機能を探索するための強力なアプローチだが、オミックスデータセットの大きさと複雑さにより、その解析はしばしば困難な課題となる。オミックスやメタオミックス解析のために開発されたソフトウェアや、遺伝子、タンパク質、分類、機能アノテーションなどの情報を網羅した知識ベースは、オミックスデータを解析するための貴重なリソースである。メタオミクス解析のためのバイオインフォマティクスリソースはいくつかあるが、その多くは計算の専門的な知識を必要とする。しかし、メタゲノムやメタトランススクリプトミクス、メタプロテオミクスで得られるような大きなデータファイルを扱うには、ウェブインターフェースはよりユーザーフレンドリーであるが、しばしば苦労することがある。

　この研究では、3つの新しいバイオインフォマティクスツールを紹介する。これらは、ユーザーフレンドリーなコマンドラインインターフェースで利用でき、順次またはスタンドアロンで実行でき、機能アノテーションのために一般的なリソースを組み合わせることが可能である。UPIMAPIは配列相同性に基づくアノテーションを行い、UniProtKBからデータを取得する（例：タンパク質名、EC番号、Gene Ontology、Taxonomy、外部データベースとの相互参照）。 reCOGnizerはタンパク質配列に対して複数の機能データベース（例：CDD、NCBIfam、Pfam、Protein Clusters、SMART、TIGRFAM、COG、KOG）とマルチスレッドのドメイン相同性によるアノテーションを行い、さらにドメイン名および説明、EC番号を入手することが可能である。KEGGCharter は、遺伝子発現の差分を含むオミックス結果を KEGG 代謝パスウェイで表現している。さらに、表現された酵素の分類学的割り当てを表示するため、複数の微生物が存在するメタゲノム研究において特に有用である。

reCOGnizer、UPIMAPI および KEGGCharter を併用することで、大規模データセットの包括的かつ完全な機能的特性評価が可能となり、自然界およびバイオテクノロジーのプロセスにおける微生物活動の解釈を容易にする。

Githubより

UPIMAPIはUniProtのAPIを利用するためのコマンドラインインターフェースで、UniProtのIDマッピングにプログラム的にアクセスすることができます。UPIMAPIは、1つのコマンドで情報を取得できる膨大な数のUniProt ID（数百万）を扱うことができます。また、UPIMAPIはDIAMONDの強力なアノテーションとUniProtから直接情報を取得する利便性を結びつけ、DIAMONDでアノテーションを実行することができます。

ここではUPIMAPIについて紹介します。

インストール

Github

#依存
apt-get install packagekit-gtk3-module libasound2 libdbus-glib-1-2 libx11-xcb1

mamba create -n upimapi -y
conda activate upimapi
mamba install -c conda-forge -c bioconda upimapi -y

> upimapi.py --help

usage: upimapi.py [-h] [-i INPUT] [-o OUTPUT] [-ot OUTPUT_TABLE] [-rd RESOURCES_DIRECTORY] [-cols COLUMNS] [--blast] [--full-id FULL_ID] [--fasta] [--step STEP] [--max-tries MAX_TRIES] [--sleep SLEEP] [--no-annotation] [--local-id-mapping] [--skip-id-mapping]

[--skip-id-checking] [--skip-db-check] [-v] [-db DATABASE] [-t THREADS] [--evalue EVALUE] [--pident PIDENT] [--bitscore BITSCORE] [-mts MAX_TARGET_SEQS] [-b BLOCK_SIZE] [-c INDEX_CHUNKS] [--taxids TAXIDS]

UniProt Id Mapping through API

options:

-h, --help show this help message and exit

-i INPUT, --input INPUT

Input filename - can be: 1. a file containing a list of IDs (comma-separated values, no spaces) 2. a BLAST TSV result file (requires to be specified with the --blast parameter 3. a protein FASTA file to be annotated (requires the --use-diamond

and -db parameters) 4. nothing! If so, will read input from command line, and parse as CSV (id1,id2,...)

-o OUTPUT, --output OUTPUT

Folder to store outputs

-ot OUTPUT_TABLE, --output-table OUTPUT_TABLE

Filename of table output, where UniProt info is stored. If set, will override 'output' parameter just for that specific file

-rd RESOURCES_DIRECTORY, --resources-directory RESOURCES_DIRECTORY

Directory to store resources of UPIMAPI [~/upimapi_resources]

-cols COLUMNS, --columns COLUMNS

List of UniProt columns to obtain information from (separated by &)

--blast If input file is in BLAST TSV format (will consider one ID per line if not set) [false]

--full-id FULL_ID If IDs in database are in 'full' format: tr|XXX|XXX [auto]

--fasta Output will be generated in FASTA format [false]

--step STEP How many IDs to submit per request to the API [1000]

--max-tries MAX_TRIES

How many times to try obtaining information from UniProt before giving up [3]

--sleep SLEEP Time between requests (in seconds) [3]

--no-annotation Do not perform annotation - input must be in one of BLAST result or TXT IDs file or STDIN [false]

--local-id-mapping Perform local ID mapping of SwissProt IDs. Advisable if many IDs of SwissProt are present [false]

--skip-id-mapping If true, UPIMAPI will not perform ID mapping [false]

--skip-id-checking If true, UPIMAPI will not check if IDs are valid before mapping [false]

--skip-db-check So UPIMAPI doesn't check for (FASTA) database existence [false]

-v, --version show program's version number and exit

DIAMOND arguments:

-db DATABASE, --database DATABASE

How the reference database is inputted to UPIMAPI. 1. uniprot - UPIMAPI will download the entire UniProt and use it as reference 2. swissprot - UPIMAPI will download SwissProt and use it as reference 3. taxids - Reference proteomes will be

downloaded for the taxa specified with the --taxids, and those will be used as reference 4. a custom database - Input will be considered as the database, and will be used as reference

-t THREADS, --threads THREADS

Number of threads to use in annotation steps [total available - 2]

--evalue EVALUE Maximum e-value to report annotations for [1e-3]

--pident PIDENT Minimum pident to report annotations for.

--bitscore BITSCORE Minimum bit score to report annotations for (overrides e-value).

-mts MAX_TARGET_SEQS, --max-target-seqs MAX_TARGET_SEQS

Number of annotations to output per sequence inputed [1]

-b BLOCK_SIZE, --block-size BLOCK_SIZE

Billions of sequence letters to be processed at a time (default: auto determine best value)

-c INDEX_CHUNKS, --index-chunks INDEX_CHUNKS

Number of chunks for processing the seed index (default: auto determine best value)

--taxids TAXIDS Tax IDs to obtain protein sequences of for building a reference database.

A tool for retrieving information from UniProt.

実行方法

タンパク質のfastaファイルとデータベースを指定する。指定したデータベースがダウンロードされ、DIAMOND BLASTPサーチが実行される。その後、ID mappingまで行われる。

upimapi.py -i input_proteins.fasta -o out_dir -t 20 -db swissprot

-t umber of threads to use in annotation steps [total available - 2]
--evalue Maximum e-value to report annotations for [1e-3]
--pident Minimum pident to report annotations for.
--bitscore Minimum bit score to report annotations for (overrides e-value).
-mts <MAX_TARGET_SEQS>, --max-target-seqs <MAX_TARGET_SEQS> Number of annotations to output per sequence inputed [1]
-b Billions of sequence letters to be processed at a time (default: auto determine best value)
-db How the reference database is inputted to UPIMAPI. 1. uniprot - UPIMAPI will download the entire UniProt and use it as reference 2. swissprot - UPIMAPI will download SwissProt and use it as reference 3. taxids - Reference proteomes will be
downloaded for the taxa specified with the --taxids, and those will be used as reference 4. a custom database - Input will be considered as the database, and will be used as reference
--taxids Tax IDs to obtain protein sequences of for building a reference database.

データベースはデフォルトではシステムディスクの/home/user/upimapi_resources/に保存される。すでに存在する場合、実行する時に最新のDBをダウンロードして上書きするか既存のDBを使うか聞かれる。

出力例

out_dir/

aligned.blast；アノテーションされたタンパク質

unaligned.blast；アノテーションされていないタンパク質

uniprotinfo.tsv：指定されたデータベースの情報を含む。

UPIMAPI_results.tsv

FASTA配列をデータベースとして指定した場合、UPIMAPI はアノテーションのために DIAMOND フォーマットの新規データベースを作成する。UPIMAPIはUniProt IDでIDマッピングを行うので、データベースはUniProt IDを持っている必要がある。
純粋培養などの分類学的な構成がわかっている場合、UPIMAPI は既知の分類群のリファレンスプロテオームでデータベースを構築することができる。特定の分類群のリファレンスを構築するには、データベースを --database taxids、タックス ID を --tax-ids taxid1 taxid2 taxid3 ...と指定する。
特定のファミリー（ヒドロゲナーゼなど）のタンパク質のアノテーションにのみ関心がある場合、カスタムデータベースを入力することができる。このようなデータベースは、UniProt から手動で構築する必要がある。カスタムデータベースを入力するには、--database database.fasta と指定する。
-fasta引数を指定すると、入力されたIDに対応するタンパク質配列がFASTAファイルとして出力される。

引用

UPIMAPI, reCOGnizer and KEGGCharter: Bioinformatics tools for functional annotation and visualization of (meta)-omics datasets
João C Sequeira, Miguel Rocha, M Madalena Alves, Andreia F Salvador

Comput Struct Biotechnol J. 2022 Apr 9;20:1798-1810