タイトルの通りです。計算機サーバーにアクセスできないことが増えてきたので、ローカルで実行する手順を確認します。計算機は用意する必要があるものの、ローカルで実行すれば、パラメータの細かい調整ができるほか、10万配列とかの制限がないのでより自由に実行することができます。
Documentation
eggNOG mapper v2.1.5 to v2.1.12 · eggnogdb/eggnog-mapper Wiki · GitHub
インストール
mamba create -n eggnog-mapper python=3
conda activate eggnog-mapper
#conda
mamba install -c bioconda -c conda-forge eggnog-mapper -y
#pip
pip install eggnog-mapper
$ emapper.py -v #バージョン
$ emapper.py -v
emapper-2.1.12 / Expected eggNOG DB version: 5.0.2 / Installed eggNOG DB version: 5.0.2 / Diamond version found: diamond version 2.1.11 / MMseqs2 version found: 113e3212c137d026e297c7540e1fcd039f6812b1 / Compatible novel families DB version: 1.0.1
> emapper.py -h
$ emapper.py -h
usage: emapper.py [-h] [-v] [--list_taxa] [--cpu NUM_CPU] [--mp_start_method {fork,spawn,forkserver}] [--resume] [--override] [-i FASTA_FILE] [--itype {CDS,proteins,genome,metagenome}] [--translate] [--annotate_hits_table SEED_ORTHOLOGS_FILE] [-c FILE] [--data_dir DIR] [--genepred {search,prodigal}]
[--trans_table TRANS_TABLE_CODE] [--training_genome FILE] [--training_file FILE] [--allow_overlaps {none,strand,diff_frame,all}] [--overlap_tol FLOAT] [-m {diamond,mmseqs,hmmer,no_search,cache,novel_fams}] [--pident PIDENT] [--query_cover QUERY_COVER] [--subject_cover SUBJECT_COVER]
[--evalue EVALUE] [--score SCORE] [--dmnd_algo {auto,0,1,ctg}] [--dmnd_db DMND_DB_FILE] [--sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}] [--dmnd_iterate {yes,no}] [--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}]
[--dmnd_frameshift DMND_FRAMESHIFT] [--gapopen GAPOPEN] [--gapextend GAPEXTEND] [--block_size BLOCK_SIZE] [--index_chunks CHUNKS] [--outfmt_short] [--dmnd_ignore_warnings] [--mmseqs_db MMSEQS_DB_FILE] [--start_sens START_SENS] [--sens_steps SENS_STEPS] [--final_sens FINAL_SENS]
[--mmseqs_sub_mat SUBS_MATRIX] [-d HMMER_DB_PREFIX] [--servers_list FILE] [--qtype {hmm,seq}] [--dbtype {hmmdb,seqdb}] [--usemem] [-p PORT] [--end_port PORT] [--num_servers NUM_SERVERS] [--num_workers NUM_WORKERS] [--timeout_load_server TIMEOUT_LOAD_SERVER] [--hmm_maxhits MAXHITS]
[--report_no_hits] [--hmm_maxseqlen MAXSEQLEN] [--Z DB_SIZE] [--cut_ga] [--clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans] [--no_annot] [--dbmem] [--seed_ortholog_evalue MIN_E-VALUE] [--seed_ortholog_score MIN_SCORE] [--tax_scope TAX_SCOPE] [--tax_scope_mode TAX_SCOPE_MODE]
[--target_orthologs {one2one,many2one,one2many,many2many,all}] [--target_taxa LIST_OF_TAX_IDS] [--excluded_taxa LIST_OF_TAX_IDS] [--report_orthologs] [--go_evidence {experimental,non-electronic,all}] [--pfam_realign {none,realign,denovo}] [--md5] [--output FILE_PREFIX] [--output_dir DIR]
[--scratch_dir DIR] [--temp_dir DIR] [--no_file_comments] [--decorate_gff DECORATE_GFF] [--decorate_gff_ID_field DECORATE_GFF_ID_FIELD] [--excel]
optional arguments:
-h, --help show this help message and exit
-v, --version show version and exit. (default: False)
--list_taxa List taxa available for --tax_scope/--tax_scope_mode, and exit (default: False)
Execution Options:
--cpu NUM_CPU Number of CPUs to be used. --cpu 0 to run with all available CPUs. (default: 1)
--mp_start_method {fork,spawn,forkserver}
Sets the python multiprocessing start method. Check https://docs.python.org/3/library/multiprocessing.html. Only use if the default method is not working properly in your OS. (default: spawn)
--resume Resumes a previous emapper run, skipping results in existing output files. (default: False)
--override Overwrites output files if they exist. By default, execution is aborted if conflicting files are detected. (default: False)
Input Data Options:
-i FASTA_FILE Input FASTA file containing query sequences (proteins by default; see --itype and --translate). Required unless -m no_search. (default: None)
--itype {CDS,proteins,genome,metagenome}
Type of data in the input (-i) file. (default: proteins)
--translate When --itype CDS, translate CDS to proteins before search. When --itype genome/metagenome and --genepred search, translate predicted CDS from blastx hits to proteins. (default: False)
--annotate_hits_table SEED_ORTHOLOGS_FILE
Annotate TSV formatted table with 4 fields: query, hit, evalue, score. Usually, a .seed_orthologs file from a previous emapper.py run. Requires -m no_search. (default: None)
-c FILE, --cache FILE
File containing annotations and md5 hashes of queries, to be used as cache. Required if -m cache (default: None)
--data_dir DIR Path to eggnog-mapper databases. By default, "data/" or the path specified in the environment variable EGGNOG_DATA_DIR. (default: None)
Gene Prediction Options:
--genepred {search,prodigal}
This is applied when --itype genome or --itype metagenome. search: gene prediction is inferred from Diamond/MMseqs2 blastx hits. prodigal: gene prediction is performed using Prodigal. (default: search)
--trans_table TRANS_TABLE_CODE
This option will be used for Prodigal, Diamond or MMseqs2, depending on the mode. For Diamond searches, this option corresponds to the --query-gencode option. For MMseqs2 searches, this option corresponds to the --translation-table option. For Prodigal, this option corresponds to
the -g/--trans_table option. It is also used when --translate, check https://biopython.org/docs/1.75/api/Bio.Seq.html#Bio.Seq.Seq.translate. Default is the corresponding programs defaults. (default: None)
--training_genome FILE
A genome to run Prodigal in training mode. If this parameter is used, Prodigal will run in two steps: firstly in training mode, and secondly using the training to analize the emapper input data. See Prodigal documentation about Traning mode for more info. Only used if --genepred
prodigal. (default: None)
--training_file FILE A training file from Prodigal training mode. If this parameter is used, Prodigal will run using this training file to analyze the emapper input data. Only used if --genepred prodigal. (default: None)
--allow_overlaps {none,strand,diff_frame,all}
When using 'blastx'-based genepred (--genepred search --itype genome/metagenome) this option controls whether overlapping hits are reported or not, or if only those overlapping hits in a different strand or frame are reported. Also, the degree of accepted overlap can be controlled
with --overlap_tol. (default: none)
--overlap_tol FLOAT This value (0-1) is the proportion such that if (overlap size / hit length) > overlap_tol, hits are considered to overlap. e.g: if overlap_tol is 0.0, any overlap is considered as such. e.g: if overlap_tol is 1.0, one of the hits must overlap entirely to consider that hits do
overlap. (default: 0.0)
Search Options:
-m {diamond,mmseqs,hmmer,no_search,cache,novel_fams}
diamond: search seed orthologs using diamond (-i is required). mmseqs: search seed orthologs using MMseqs2 (-i is required). hmmer: search seed orthologs using HMMER. (-i is required). no_search: skip seed orthologs search (--annotate_hits_table is required, unless --no_annot).
cache: skip seed orthologs search and annotate based on cached results (-i and -c are required).novel_fams: search against the novel families database (-i is required). (default: diamond)
Search filtering common options:
--pident PIDENT Report only alignments above or equal to the given percentage of identity (0-100).No effect if -m hmmer. (default: None)
--query_cover QUERY_COVER
Report only alignments above or equal the given percentage of query cover (0-100).No effect if -m hmmer. (default: None)
--subject_cover SUBJECT_COVER
Report only alignments above or equal the given percentage of subject cover (0-100).No effect if -m hmmer. (default: None)
--evalue EVALUE Report only alignments below or equal the e-value threshold. (default: 0.001)
--score SCORE Report only alignments above or equal the score threshold. (default: None)
Diamond Search Options:
--dmnd_algo {auto,0,1,ctg}
Diamond's --algo option, which can be tuned to search small query sets. By default, it is adjusted automatically. However, the ctg option should be activated manually. If you plan to search a small input set of sequences, use --dmnd_algo ctg to make it faster. (default: auto)
--dmnd_db DMND_DB_FILE
Path to DIAMOND-compatible database (default: None)
--sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}
Diamond's sensitivity mode. Note that emapper's default is sensitive, which is different from diamond's default, which can be activated with --sensmode default. (default: sensitive)
--dmnd_iterate {yes,no}
--dmnd_iterate yes --> activates the --iterate option of diamond for iterative searches, from faster, less sensitive modes, up to the sensitivity specified with --sensmode. Available since diamond 2.0.11. --dmnd_iterate no --> disables the --iterate mode. (default: yes)
--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}
Scoring matrix (default: None)
--dmnd_frameshift DMND_FRAMESHIFT
Diamond --frameshift/-F option. Not used by default. Recommended by diamond: 15. (default: None)
--gapopen GAPOPEN Gap open penalty (default: None)
--gapextend GAPEXTEND
Gap extend penalty (default: None)
--block_size BLOCK_SIZE
Diamond -b/--block-size option. Default is the diamond's default. (default: None)
--index_chunks CHUNKS
Diamond -c/--index-chunks option. Default is the diamond's default. (default: None)
--outfmt_short Diamond output will include only qseqid sseqid evalue and score. This could help obtain better performance, if also no --pident, --query_cover or --subject_cover thresholds are used. This option is ignored when the diamond search is run in blastx mode for gene prediction (see
--genepred). (default: False)
--dmnd_ignore_warnings
Diamond --ignore-warnings option. It avoids Diamond stopping due to warnings (e.g. when a protein contains only ATGC symbols. (default: False)
MMseqs2 Search Options:
--mmseqs_db MMSEQS_DB_FILE
Path to MMseqs2-compatible database (default: None)
--start_sens START_SENS
Starting sensitivity. (default: 3)
--sens_steps SENS_STEPS
Number of sensitivity steps. (default: 3)
--final_sens FINAL_SENS
Final sensititivy step. (default: 7)
--mmseqs_sub_mat SUBS_MATRIX
Matrix to be used for --sub-mat MMseqs2 search option. Default=default used by MMseqs2 (default: None)
HMMER Search Options:
-d HMMER_DB_PREFIX, --database HMMER_DB_PREFIX
specify the target database for sequence searches. Choose among: euk,bact,arch, or a database loaded in a server, db.hmm:host:port (see hmm_server.py) (default: None)
--servers_list FILE A FILE with a list of remote hmmpgmd servers. Each row in the file represents a server, in the format 'host:port'. If --servers_list is specified, host and port from -d option will be ignored. (default: None)
--qtype {hmm,seq} Type of input data (-i). (default: seq)
--dbtype {hmmdb,seqdb}
Type of data in DB (-db). (default: hmmdb)
--usemem Use this option to allocate the whole database (-d) in memory using hmmpgmd. If --dbtype hmm, the database must be a hmmpress-ed database. If --dbtype seqdb, the database must be a HMMER-format database created with esl-reformat. Database will be unloaded after execution. Note that
this only works for HMMER based searches. To load the eggnog-mapper annotation DB into memory use --dbmem. (default: False)
-p PORT, --port PORT Port used to setup HMM server, when --usemem. Also used for --pfam_realign modes. (default: 51700)
--end_port PORT Last port to be used to setup HMM server, when --usemem. Also used for --pfam_realign modes. (default: 53200)
--num_servers NUM_SERVERS
When using --usemem, specify the number of servers to fire up.Note that cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign modes. (default: 1)
--num_workers NUM_WORKERS
When using --usemem, specify the number of workers per server (--num_servers) to fire up. By default, cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign modes. (default: 1)
--timeout_load_server TIMEOUT_LOAD_SERVER
Number of attempts to load a server on a specific port. If failed, the next numerical port will be tried. (default: 10)
--hmm_maxhits MAXHITS
Max number of hits to report (0 to report all). (default: 1)
--report_no_hits Whether queries without hits should be included in the output table. (default: False)
--hmm_maxseqlen MAXSEQLEN
Ignore query sequences larger than `maxseqlen`. (default: 5000)
--Z DB_SIZE Fixed database size used in phmmer/hmmscan (allows comparing e-values among databases). (default: 40000000)
--cut_ga Adds the --cut_ga to hmmer commands (useful for Pfam mappings, for example). See hmmer documentation. (default: False)
--clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans
Removes those hits which overlap, keeping only the one with best evalue. Use the "all" and "clans" options when performing a hmmscan type search (i.e. domains are in the database). Use the "hmmsearch_all" and "hmmsearch_clans" options when using a hmmsearch type search (i.e.
domains are the queries from -i file). The "clans" and "hmmsearch_clans" and options will only have effect for hits to/from Pfam. (default: None)
Annotation Options:
--no_annot Skip functional annotation, reporting only hits. (default: False)
--dbmem Use this option to allocate the whole eggnog.db DB in memory. Database will be unloaded after execution. (default: False)
--seed_ortholog_evalue MIN_E-VALUE
Min E-value expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. (default: 0.001)
--seed_ortholog_score MIN_SCORE
Min bit score expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. (default: None)
--tax_scope TAX_SCOPE
Fix the taxonomic scope used for annotation, so only speciation events from a particular clade are used for functional transfer. More specifically, the --tax_scope list is intersected with the seed orthologs clades, and the resulting clades are used for annotation based on
--tax_scope_mode. Note that those seed orthologs without clades intersecting with --tax_scope will be filtered out, and won't annotated. Possible arguments for --tax_scope are: 1) A path to a file defined by the user, which contains a list of tax IDs and/or tax names. 2) The name
of a pre-configured tax scope, whose source is a file stored within the 'eggnogmapper/annotation/tax_scopes/' directory By default, available ones are: 'auto' ('all'), 'auto_broad' ('all_broad'), 'all_narrow', 'archaea', 'bacteria', 'bacteria_broad', 'eukaryota', 'eukaryota_broad'
and 'prokaryota_broad'.3) A comma-separated list of taxonomic names and/or taxonomic IDs, sorted by preference. An example of list of tax IDs would be 2759,2157,2,1 for Eukaryota, Archaea, Bacteria and root, in that order of preference. 4) 'none': do not filter out annotations
based on taxonomic scope. (default: auto)
--tax_scope_mode TAX_SCOPE_MODE
For a seed ortholog which passed the filter imposed by --tax_scope, --tax_scope_mode controls which specific clade, to which the seed ortholog belongs, will be used for annotation. Options: 1) broadest: use the broadest clade. 2) inner_broadest: use the broadest clade from the
intersection with --tax_scope. 3) inner_narrowest: use the narrowest clade from the intersection with --tax_scope. 4) narrowest: use the narrowest clade. 5) A taxonomic scope as in --tax_scope: use this second list to intersect with seed ortholog clades and use the narrowest (as in
inner_narrowest) from the intersection to annotate. (default: inner_narrowest)
--target_orthologs {one2one,many2one,one2many,many2many,all}
defines what type of orthologs (in relation to the seed ortholog) should be used for functional transfer (default: all)
--target_taxa LIST_OF_TAX_IDS
Only orthologs from the specified comma-separated list of taxa and all its descendants will be used for annotation transference. By default, all taxa are used. (default: None)
--excluded_taxa LIST_OF_TAX_IDS
Orthologs from the specified comma-separated list of taxa and all its descendants will not be used for annotation transference. By default, no taxa is excluded. (default: None)
--report_orthologs Output the list of orthologs found for each query to a .orthologs file (default: False)
--go_evidence {experimental,non-electronic,all}
Defines what type of GO terms should be used for annotation. experimental = Use only terms inferred from experimental evidence. non-electronic = Use only non-electronically curated terms (default: non-electronic)
--pfam_realign {none,realign,denovo}
Realign the queries to the PFAM domains. none = no realignment is performed. PFAM annotation will be that transferred as specify in the --pfam_transfer option. realign = queries will be realigned to the PFAM domains found according to the --pfam_transfer option. denovo = queries
will be realigned to the whole PFAM database, ignoring the --pfam_transfer option. Check hmmer options (--num_servers, --num_workers, --port, --end_port) to change how the hmmpgmd server is run. (default: none)
--md5 Adds the md5 hash of each query as an additional field in annotations output files. (default: False)
Output options:
--output FILE_PREFIX, -o FILE_PREFIX
base name for output files (default: None)
--output_dir DIR Where output files should be written (default: /media/kazu/8TB7/eggNOG_DB)
--scratch_dir DIR Write output files in a temporary scratch dir, move them to the final output dir when finished. Speed up large computations using network file systems. (default: None)
--temp_dir DIR Where temporary files are created. Better if this is a local disk. (default: /media/kazu/8TB7/eggNOG_DB)
--no_file_comments No header lines nor stats are included in the output files (default: False)
--decorate_gff DECORATE_GFF
Add search hits and/or annotation results to GFF file from gene prediction of a user specified one. no = no GFF decoration at all. GFF file from blastx-based gene prediction will be created anyway. yes = add search hits and/or annotations to GFF file from Prodigal or blastx-based
gene prediction. FILE = decorate the specified pre-existing GFF FILE. e.g. --decorage_gff myfile.gff You change the field interpreted as ID of the feature with --decorate_gff_ID_field. (default: no)
--decorate_gff_ID_field DECORATE_GFF_ID_FIELD
Change the field used in GFF files as ID of the feature. (default: ID)
--excel Output annotations also in .xlsx format. (default: False)
> download_eggnog_data.py -h
$ download_eggnog_data.py -h
usage: download_eggnog_data.py [-h] [-D] [-F] [-P] [-M] [-H] [-d HMMER_DBS] [--dbname DBNAME] [-y] [-f] [-s] [-q] [--data_dir]
optional arguments:
-h, --help show this help message and exit
-D Do not install the diamond database (default: False)
-F Install the novel families diamond and annotation databases, required for "emapper.py -m novel_fams" (default: False)
-P Install the Pfam database, required for de novo annotation or realignment (default: False)
-M Install the MMseqs2 database, required for "emapper.py -m mmseqs" (default: False)
-H Install the HMMER database specified with "-d TAXID". Required for "emapper.py -m hmmer -d TAXID" (default: False)
-d HMMER_DBS Tax ID of eggNOG HMM database to download. e.g. "-H -d 2" for Bacteria. Required if "-H". Available tax IDs can be found at http://eggnog5.embl.de/#/app/downloads. (default: None)
--dbname DBNAME Tax ID of eggNOG HMM database to download. e.g. "-H -d 2 --dbname 'Bacteria'" to download Bacteria (taxid 2) to a directory named Bacteria (default: None)
-y assume "yes" to all questions (default: False)
-f forces download even if the files exist (default: False)
-s simulate and print commands. Nothing is downloaded (default: False)
-q quiet_mode (default: False)
--data_dir Directory to use for DATA_PATH. (default: None)
データベース
保存するディレクトリを指定してダウンロードスクリプトを実行する。
export EGGNOG_DATA_DIR=<path>/<to>/DB
download_eggnog_data.py
- -D Do not install the diamond database (default: False)
- -F Install the novel families diamond and annotation databases, required for "emapper.py -m novel_fams" (default: False)
- -P Install the Pfam database, required for de novo annotation or realignment (default: False)
- -M Install the MMseqs2 database, required for "emapper.py -m mmseqs" (default: False)
- -H Install the HMMER database specified with "-d TAXID". Required for "emapper.py -m hmmer -d TAXID" (default: False)
ダウンロードはyes|noの対話式で進める
2時間ほどかかった。
オプションなし、デフォルト設定で取得したDB
> ls -lh
実行方法
アノテーションをつけたい配列のfastaファイルを指定する。
mkdir outdir
emapper.py -i input.fasta -o outprefix --output_dir outdir --cpu 0 --itype proteins -m diamond
- -i Input FASTA file containing query sequences (proteins by default; see --itype and --translate). Required unless -m no_search. (default: None)
- --itype {CDS,proteins,genome,metagenome} Type of data in the input (-i) file. (default: proteins)
- --translate When --itype CDS, translate CDS to proteins before search. When --itype genome/metagenome and --genepred search, translate predicted CDS from blastx hits to proteins. (default: False)
- --cpu Number of CPUs to be used. --cpu 0 to run with all available CPUs. (default: 1)
- --output base name for output files (default: None)
- --output_dir Where output files should be written (default: current PATH)
- -m {diamond,mmseqs,hmmer,no_search,cache,novel_fams} diamond: search seed orthologs using diamond (-i is required). mmseqs: search seed orthologs using MMseqs2 (-i is required). hmmer: search seed orthologs using HMMER. (-i is required). no_search: skip seed orthologs search (--annotate_hits_table is required, unless --no_annot). cache: skip seed orthologs search and annotate based on cached results (-i and -c are required).novel_fams: search against the novel families database (-i is required). (default: diamond)
出力例
outprefix.emapper.annotations.tsv
引用
Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper
Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, Bork P
Mol Biol Evol. 2017 Aug 1;34(8):2115-2122
* Error running diamond: Error: Invalid option: iterateが出た。
diamondが古くダウンロードしたdiamond DBが読み取れないのが原因なので、以下の通り最新のdiamond binary(ユーザーのシステム上のライブラリを使用)をダウンロードしてパスを通した。
# downloading the tool
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.11/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
関連