Githubより
reCOGnizerは、RPS-BLASTとCDDのデータベースをリファレンスとして、ドメインベースのアノテーションを行う。現在実装されている参照データベースはCDD, NCBIfam, Pfam, TIGRFAM, Protein Clusters, SMART, COG and KOGとなっている。reCOGnizerは、RPS-BLASTがマルチスレッドで実行できるように、これらのデータベースを分割して構築し、アノテーションの速度を大幅に向上させる。reCOGnizerは、タンパク質にドメインを割り当てた後、CDDのIDを各DBのIDに変換し、CDDで利用可能なドメインの説明を取得する。さらに、当該DBに応じた情報を取得する。NCBIfam, Pfam, TIGRFAM, Protein Clustersのアノテーションは、分類とEC番号で補完される。SMARTアノテーションは、SMARTディスクリプションで補完される。COGおよびKOGアノテーションは、COGカテゴリおよびEC番号、KEGG Orthologs(COGの場合)で補完されている。
インストール
mamba create -n recognizer -y
conda activate recognizer
mamba install -c conda-forge -c bioconda recognizer -y
> recognizer.py -v
$ recognizer.py -h
usage: recognizer.py [-h] [-f FILE] [-t THREADS] [--evalue EVALUE] [--pident PIDENT] [-o OUTPUT] [-dr] [-rd RESOURCES_DIRECTORY] [-dbs DATABASES] [-db DATABASE]
[--custom-database] [-mts MAX_TARGET_SEQS] [--keep-spaces] [--no-output-sequences] [--no-blast-info] [--quiet] [-sd] [--keep-intermediates] [-v]
[--tax-file TAX_FILE] [--protein-id-col PROTEIN_ID_COL] [--tax-col TAX_COL] [--species-taxids]
reCOGnizer - a tool for domain based annotation with the CDD database
options:
-h, --help show this help message and exit
-f FILE, --file FILE Fasta file with protein sequences for annotation
-t THREADS, --threads THREADS
Number of threads for reCOGnizer to use [max available - 2]
--evalue EVALUE Maximum e-value to report annotations for [1e-2]
--pident PIDENT [DEPRECATED] Minimum pident to report annotations for [0]
-o OUTPUT, --output OUTPUT
Output directory [reCOGnizer_results]
-dr, --download-resources
If resources for reCOGnizer are not available at "resources_directory" [false]
-rd RESOURCES_DIRECTORY, --resources-directory RESOURCES_DIRECTORY
Output directory for storing databases and other resources [~/recognizer_resources]
-dbs DATABASES, --databases DATABASES
Databases to include in functional annotation (comma-separated) [all available]
-db DATABASE, --database DATABASE
Basename of database for annotation. If multiple databases, use comma separated list (db1,db2,db3)
--custom-database If database was NOT produced by reCOGnizer
-mts MAX_TARGET_SEQS, --max-target-seqs MAX_TARGET_SEQS
Number of maximum identifications for each protein [1]
--keep-spaces BLAST ignores sequences IDs after the first space. This option changes all spaces to underscores to keep the full IDs.
--no-output-sequences
Protein sequences from the FASTA input will be stored in their own column.
--no-blast-info Information from the alignment will be stored in their own columns.
--quiet Don't output download information, used mainly for CI.
-sd, --skip-downloaded
Skip download of resources detected as already downloaded.
--keep-intermediates Keep intermediate annotation files generated in reCOGnizer's workflow, i.e., ASN, RPSBPROC and BLAST reports and split FASTA inputs.
-v, --version show program's version number and exit
Taxonomy Arguments:
--tax-file TAX_FILE File with taxonomic identification of proteins inputted (TSV). Must have one line per query, query name on first column, taxid on second.
--protein-id-col PROTEIN_ID_COL
Name of column with protein headers as in supplied FASTA file [qseqid]
--tax-col TAX_COL Name of column with tax IDs of proteins [Taxonomic identifier (SPECIES)]
--species-taxids If tax col contains Tax IDs of species (required for running COG taxonomic)
Input file must be specified.
実行方法
タンパク質のfastaファイルと出力ディレクトリを指定する。初回はデータベースが"-rd"で指定したディレクトリにダウンロードされる。
recognizer.py -f input_file.fasta -o recognizer_output -rd resources_directory
- -f Fasta file with protein sequences for annotation
- -t Number of threads for reCOGnizer to use [max available - 2]
- --evalue Maximum e-value to report annotations for [1e-2]
- --pident [DEPRECATED] Minimum pident to report annotations for [0]
- -o Output directory [reCOGnizer_results]
reCOGnizer_results/
主に 2 つの出力が出力ディレクトリに生成される。
- reCOGnizer_results.tsv、各タンパク質のアノテーションを含むテーブル。
- cog_quantificationとそれぞれのKrona表現:入力ファイル中のタンパク質の機能ランドスケープを記述したもの。
引用
UPIMAPI, reCOGnizer and KEGGCharter: Bioinformatics tools for functional annotation and visualization of (meta)-omics datasets
João C Sequeira, Miguel Rocha, M Madalena Alves, Andreia F Salvador
Comput Struct Biotechnol J. 2022 Apr 9;20:1798-1810
関連