reCOGnizer - macでインフォマティクス

Githubより

reCOGnizerは、RPS-BLASTとCDDのデータベースをリファレンスとして、ドメインベースのアノテーションを行う。現在実装されている参照データベースはCDD, NCBIfam, Pfam, TIGRFAM, Protein Clusters, SMART, COG and KOGとなっている。reCOGnizerは、RPS-BLASTがマルチスレッドで実行できるように、これらのデータベースを分割して構築し、アノテーションの速度を大幅に向上させる。reCOGnizerは、タンパク質にドメインを割り当てた後、CDDのIDを各DBのIDに変換し、CDDで利用可能なドメインの説明を取得する。さらに、当該DBに応じた情報を取得する。NCBIfam, Pfam, TIGRFAM, Protein Clustersのアノテーションは、分類とEC番号で補完される。SMARTアノテーションは、SMARTディスクリプションで補完される。COGおよびKOGアノテーションは、COGカテゴリおよびEC番号、KEGG Orthologs（COGの場合）で補完されている。

インストール

Github

mamba create -n recognizer -y
conda activate recognizer
mamba install -c conda-forge -c bioconda recognizer -y

> recognizer.py -v

$ recognizer.py -h

usage: recognizer.py [-h] [-f FILE] [-t THREADS] [--evalue EVALUE] [--pident PIDENT] [-o OUTPUT] [-dr] [-rd RESOURCES_DIRECTORY] [-dbs DATABASES] [-db DATABASE]

[--custom-database] [-mts MAX_TARGET_SEQS] [--keep-spaces] [--no-output-sequences] [--no-blast-info] [--quiet] [-sd] [--keep-intermediates] [-v]

[--tax-file TAX_FILE] [--protein-id-col PROTEIN_ID_COL] [--tax-col TAX_COL] [--species-taxids]

reCOGnizer - a tool for domain based annotation with the CDD database

options:

-h, --help show this help message and exit

-f FILE, --file FILE Fasta file with protein sequences for annotation

-t THREADS, --threads THREADS

Number of threads for reCOGnizer to use [max available - 2]

--evalue EVALUE Maximum e-value to report annotations for [1e-2]

--pident PIDENT [DEPRECATED] Minimum pident to report annotations for [0]

-o OUTPUT, --output OUTPUT

Output directory [reCOGnizer_results]

-dr, --download-resources

If resources for reCOGnizer are not available at "resources_directory" [false]

-rd RESOURCES_DIRECTORY, --resources-directory RESOURCES_DIRECTORY

Output directory for storing databases and other resources [~/recognizer_resources]

-dbs DATABASES, --databases DATABASES

Databases to include in functional annotation (comma-separated) [all available]

-db DATABASE, --database DATABASE

Basename of database for annotation. If multiple databases, use comma separated list (db1,db2,db3)

--custom-database If database was NOT produced by reCOGnizer

-mts MAX_TARGET_SEQS, --max-target-seqs MAX_TARGET_SEQS

Number of maximum identifications for each protein [1]

--keep-spaces BLAST ignores sequences IDs after the first space. This option changes all spaces to underscores to keep the full IDs.

--no-output-sequences

Protein sequences from the FASTA input will be stored in their own column.

--no-blast-info Information from the alignment will be stored in their own columns.

--quiet Don't output download information, used mainly for CI.

-sd, --skip-downloaded

Skip download of resources detected as already downloaded.

--keep-intermediates Keep intermediate annotation files generated in reCOGnizer's workflow, i.e., ASN, RPSBPROC and BLAST reports and split FASTA inputs.

-v, --version show program's version number and exit

Taxonomy Arguments:

--tax-file TAX_FILE File with taxonomic identification of proteins inputted (TSV). Must have one line per query, query name on first column, taxid on second.

--protein-id-col PROTEIN_ID_COL

Name of column with protein headers as in supplied FASTA file [qseqid]

--tax-col TAX_COL Name of column with tax IDs of proteins [Taxonomic identifier (SPECIES)]

--species-taxids If tax col contains Tax IDs of species (required for running COG taxonomic)

Input file must be specified.

実行方法

タンパク質のfastaファイルと出力ディレクトリを指定する。初回はデータベースが"-rd"で指定したディレクトリにダウンロードされる。

recognizer.py -f input_file.fasta -o recognizer_output -rd resources_directory

-f Fasta file with protein sequences for annotation
-t Number of threads for reCOGnizer to use [max available - 2]
--evalue Maximum e-value to report annotations for [1e-2]
--pident [DEPRECATED] Minimum pident to report annotations for [0]
-o Output directory [reCOGnizer_results]

reCOGnizer_results/

主に 2 つの出力が出力ディレクトリに生成される。

reCOGnizer_results.tsv、各タンパク質のアノテーションを含むテーブル。
cog_quantificationとそれぞれのKrona表現：入力ファイル中のタンパク質の機能ランドスケープを記述したもの。

引用

UPIMAPI, reCOGnizer and KEGGCharter: Bioinformatics tools for functional annotation and visualization of (meta)-omics datasets
João C Sequeira, Miguel Rocha, M Madalena Alves, Andreia F Salvador

Comput Struct Biotechnol J. 2022 Apr 9;20:1798-1810