macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

eggNOG-Mapperをローカルで実行する

 

タイトルの通りです。計算機サーバーにアクセスできないことが増えてきたので、ローカルで実行する手順を確認します。計算機は用意する必要があるものの、ローカルで実行すれば、パラメータの細かい調整ができるほか、10万配列とかの制限がないのでより自由に実行することができます。

 

Documentation

eggNOG mapper v2.1.5 to v2.1.12 · eggnogdb/eggnog-mapper Wiki · GitHub

 

インストール

Github

mamba create -n eggnog-mapper python=3
conda activate eggnog-mapper
#conda
mamba install -c bioconda -c conda-forge eggnog-mapper -y

#pip
pip install eggnog-mapper

$ emapper.py -v #バージョン

$  emapper.py -v 

emapper-2.1.12 / Expected eggNOG DB version: 5.0.2 / Installed eggNOG DB version: 5.0.2 / Diamond version found: diamond version 2.1.11 / MMseqs2 version found: 113e3212c137d026e297c7540e1fcd039f6812b1 / Compatible novel families DB version: 1.0.1

> emapper.py -h

$ emapper.py -h

usage: emapper.py [-h] [-v] [--list_taxa] [--cpu NUM_CPU] [--mp_start_method {fork,spawn,forkserver}] [--resume] [--override] [-i FASTA_FILE] [--itype {CDS,proteins,genome,metagenome}] [--translate] [--annotate_hits_table SEED_ORTHOLOGS_FILE] [-c FILE] [--data_dir DIR] [--genepred {search,prodigal}]

                  [--trans_table TRANS_TABLE_CODE] [--training_genome FILE] [--training_file FILE] [--allow_overlaps {none,strand,diff_frame,all}] [--overlap_tol FLOAT] [-m {diamond,mmseqs,hmmer,no_search,cache,novel_fams}] [--pident PIDENT] [--query_cover QUERY_COVER] [--subject_cover SUBJECT_COVER]

                  [--evalue EVALUE] [--score SCORE] [--dmnd_algo {auto,0,1,ctg}] [--dmnd_db DMND_DB_FILE] [--sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}] [--dmnd_iterate {yes,no}] [--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}]

                  [--dmnd_frameshift DMND_FRAMESHIFT] [--gapopen GAPOPEN] [--gapextend GAPEXTEND] [--block_size BLOCK_SIZE] [--index_chunks CHUNKS] [--outfmt_short] [--dmnd_ignore_warnings] [--mmseqs_db MMSEQS_DB_FILE] [--start_sens START_SENS] [--sens_steps SENS_STEPS] [--final_sens FINAL_SENS]

                  [--mmseqs_sub_mat SUBS_MATRIX] [-d HMMER_DB_PREFIX] [--servers_list FILE] [--qtype {hmm,seq}] [--dbtype {hmmdb,seqdb}] [--usemem] [-p PORT] [--end_port PORT] [--num_servers NUM_SERVERS] [--num_workers NUM_WORKERS] [--timeout_load_server TIMEOUT_LOAD_SERVER] [--hmm_maxhits MAXHITS]

                  [--report_no_hits] [--hmm_maxseqlen MAXSEQLEN] [--Z DB_SIZE] [--cut_ga] [--clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans] [--no_annot] [--dbmem] [--seed_ortholog_evalue MIN_E-VALUE] [--seed_ortholog_score MIN_SCORE] [--tax_scope TAX_SCOPE] [--tax_scope_mode TAX_SCOPE_MODE]

                  [--target_orthologs {one2one,many2one,one2many,many2many,all}] [--target_taxa LIST_OF_TAX_IDS] [--excluded_taxa LIST_OF_TAX_IDS] [--report_orthologs] [--go_evidence {experimental,non-electronic,all}] [--pfam_realign {none,realign,denovo}] [--md5] [--output FILE_PREFIX] [--output_dir DIR]

                  [--scratch_dir DIR] [--temp_dir DIR] [--no_file_comments] [--decorate_gff DECORATE_GFF] [--decorate_gff_ID_field DECORATE_GFF_ID_FIELD] [--excel]

 

optional arguments:

  -h, --help            show this help message and exit

  -v, --version         show version and exit. (default: False)

  --list_taxa           List taxa available for --tax_scope/--tax_scope_mode, and exit (default: False)

 

Execution Options:

  --cpu NUM_CPU         Number of CPUs to be used. --cpu 0 to run with all available CPUs. (default: 1)

  --mp_start_method {fork,spawn,forkserver}

                        Sets the python multiprocessing start method. Check https://docs.python.org/3/library/multiprocessing.html. Only use if the default method is not working properly in your OS. (default: spawn)

  --resume              Resumes a previous emapper run, skipping results in existing output files. (default: False)

  --override            Overwrites output files if they exist. By default, execution is aborted if conflicting files are detected. (default: False)

 

Input Data Options:

  -i FASTA_FILE         Input FASTA file containing query sequences (proteins by default; see --itype and --translate). Required unless -m no_search. (default: None)

  --itype {CDS,proteins,genome,metagenome}

                        Type of data in the input (-i) file. (default: proteins)

  --translate           When --itype CDS, translate CDS to proteins before search. When --itype genome/metagenome and --genepred search, translate predicted CDS from blastx hits to proteins. (default: False)

  --annotate_hits_table SEED_ORTHOLOGS_FILE

                        Annotate TSV formatted table with 4 fields: query, hit, evalue, score. Usually, a .seed_orthologs file from a previous emapper.py run. Requires -m no_search. (default: None)

  -c FILE, --cache FILE

                        File containing annotations and md5 hashes of queries, to be used as cache. Required if -m cache (default: None)

  --data_dir DIR        Path to eggnog-mapper databases. By default, "data/" or the path specified in the environment variable EGGNOG_DATA_DIR. (default: None)

 

Gene Prediction Options:

  --genepred {search,prodigal}

                        This is applied when --itype genome or --itype metagenome. search: gene prediction is inferred from Diamond/MMseqs2 blastx hits. prodigal: gene prediction is performed using Prodigal. (default: search)

  --trans_table TRANS_TABLE_CODE

                        This option will be used for Prodigal, Diamond or MMseqs2, depending on the mode. For Diamond searches, this option corresponds to the --query-gencode option. For MMseqs2 searches, this option corresponds to the --translation-table option. For Prodigal, this option corresponds to

                        the -g/--trans_table option. It is also used when --translate, check https://biopython.org/docs/1.75/api/Bio.Seq.html#Bio.Seq.Seq.translate. Default is the corresponding programs defaults. (default: None)

  --training_genome FILE

                        A genome to run Prodigal in training mode. If this parameter is used, Prodigal will run in two steps: firstly in training mode, and secondly using the training to analize the emapper input data. See Prodigal documentation about Traning mode for more info. Only used if --genepred

                        prodigal. (default: None)

  --training_file FILE  A training file from Prodigal training mode. If this parameter is used, Prodigal will run using this training file to analyze the emapper input data. Only used if --genepred prodigal. (default: None)

  --allow_overlaps {none,strand,diff_frame,all}

                        When using 'blastx'-based genepred (--genepred search --itype genome/metagenome) this option controls whether overlapping hits are reported or not, or if only those overlapping hits in a different strand or frame are reported. Also, the degree of accepted overlap can be controlled

                        with --overlap_tol. (default: none)

  --overlap_tol FLOAT   This value (0-1) is the proportion such that if (overlap size / hit length) > overlap_tol, hits are considered to overlap. e.g: if overlap_tol is 0.0, any overlap is considered as such. e.g: if overlap_tol is 1.0, one of the hits must overlap entirely to consider that hits do

                        overlap. (default: 0.0)

 

Search Options:

  -m {diamond,mmseqs,hmmer,no_search,cache,novel_fams}

                        diamond: search seed orthologs using diamond (-i is required). mmseqs: search seed orthologs using MMseqs2 (-i is required). hmmer: search seed orthologs using HMMER. (-i is required). no_search: skip seed orthologs search (--annotate_hits_table is required, unless --no_annot).

                        cache: skip seed orthologs search and annotate based on cached results (-i and -c are required).novel_fams: search against the novel families database (-i is required). (default: diamond)

 

Search filtering common options:

  --pident PIDENT       Report only alignments above or equal to the given percentage of identity (0-100).No effect if -m hmmer. (default: None)

  --query_cover QUERY_COVER

                        Report only alignments above or equal the given percentage of query cover (0-100).No effect if -m hmmer. (default: None)

  --subject_cover SUBJECT_COVER

                        Report only alignments above or equal the given percentage of subject cover (0-100).No effect if -m hmmer. (default: None)

  --evalue EVALUE       Report only alignments below or equal the e-value threshold. (default: 0.001)

  --score SCORE         Report only alignments above or equal the score threshold. (default: None)

 

Diamond Search Options:

  --dmnd_algo {auto,0,1,ctg}

                        Diamond's --algo option, which can be tuned to search small query sets. By default, it is adjusted automatically. However, the ctg option should be activated manually. If you plan to search a small input set of sequences, use --dmnd_algo ctg to make it faster. (default: auto)

  --dmnd_db DMND_DB_FILE

                        Path to DIAMOND-compatible database (default: None)

  --sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}

                        Diamond's sensitivity mode. Note that emapper's default is sensitive, which is different from diamond's default, which can be activated with --sensmode default. (default: sensitive)

  --dmnd_iterate {yes,no}

                        --dmnd_iterate yes --> activates the --iterate option of diamond for iterative searches, from faster, less sensitive modes, up to the sensitivity specified with --sensmode. Available since diamond 2.0.11. --dmnd_iterate no --> disables the --iterate mode. (default: yes)

  --matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}

                        Scoring matrix (default: None)

  --dmnd_frameshift DMND_FRAMESHIFT

                        Diamond --frameshift/-F option. Not used by default. Recommended by diamond: 15. (default: None)

  --gapopen GAPOPEN     Gap open penalty (default: None)

  --gapextend GAPEXTEND

                        Gap extend penalty (default: None)

  --block_size BLOCK_SIZE

                        Diamond -b/--block-size option. Default is the diamond's default. (default: None)

  --index_chunks CHUNKS

                        Diamond -c/--index-chunks option. Default is the diamond's default. (default: None)

  --outfmt_short        Diamond output will include only qseqid sseqid evalue and score. This could help obtain better performance, if also no --pident, --query_cover or --subject_cover thresholds are used. This option is ignored when the diamond search is run in blastx mode for gene prediction (see

                        --genepred). (default: False)

  --dmnd_ignore_warnings

                        Diamond --ignore-warnings option. It avoids Diamond stopping due to warnings (e.g. when a protein contains only ATGC symbols. (default: False)

 

MMseqs2 Search Options:

  --mmseqs_db MMSEQS_DB_FILE

                        Path to MMseqs2-compatible database (default: None)

  --start_sens START_SENS

                        Starting sensitivity. (default: 3)

  --sens_steps SENS_STEPS

                        Number of sensitivity steps. (default: 3)

  --final_sens FINAL_SENS

                        Final sensititivy step. (default: 7)

  --mmseqs_sub_mat SUBS_MATRIX

                        Matrix to be used for --sub-mat MMseqs2 search option. Default=default used by MMseqs2 (default: None)

 

HMMER Search Options:

  -d HMMER_DB_PREFIX, --database HMMER_DB_PREFIX

                        specify the target database for sequence searches. Choose among: euk,bact,arch, or a database loaded in a server, db.hmm:host:port (see hmm_server.py) (default: None)

  --servers_list FILE   A FILE with a list of remote hmmpgmd servers. Each row in the file represents a server, in the format 'host:port'. If --servers_list is specified, host and port from -d option will be ignored. (default: None)

  --qtype {hmm,seq}     Type of input data (-i). (default: seq)

  --dbtype {hmmdb,seqdb}

                        Type of data in DB (-db). (default: hmmdb)

  --usemem              Use this option to allocate the whole database (-d) in memory using hmmpgmd. If --dbtype hmm, the database must be a hmmpress-ed database. If --dbtype seqdb, the database must be a HMMER-format database created with esl-reformat. Database will be unloaded after execution. Note that

                        this only works for HMMER based searches. To load the eggnog-mapper annotation DB into memory use --dbmem. (default: False)

  -p PORT, --port PORT  Port used to setup HMM server, when --usemem. Also used for --pfam_realign modes. (default: 51700)

  --end_port PORT       Last port to be used to setup HMM server, when --usemem. Also used for --pfam_realign modes. (default: 53200)

  --num_servers NUM_SERVERS

                        When using --usemem, specify the number of servers to fire up.Note that cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign modes. (default: 1)

  --num_workers NUM_WORKERS

                        When using --usemem, specify the number of workers per server (--num_servers) to fire up. By default, cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign modes. (default: 1)

  --timeout_load_server TIMEOUT_LOAD_SERVER

                        Number of attempts to load a server on a specific port. If failed, the next numerical port will be tried. (default: 10)

  --hmm_maxhits MAXHITS

                        Max number of hits to report (0 to report all). (default: 1)

  --report_no_hits      Whether queries without hits should be included in the output table. (default: False)

  --hmm_maxseqlen MAXSEQLEN

                        Ignore query sequences larger than `maxseqlen`. (default: 5000)

  --Z DB_SIZE           Fixed database size used in phmmer/hmmscan (allows comparing e-values among databases). (default: 40000000)

  --cut_ga              Adds the --cut_ga to hmmer commands (useful for Pfam mappings, for example). See hmmer documentation. (default: False)

  --clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans

                        Removes those hits which overlap, keeping only the one with best evalue. Use the "all" and "clans" options when performing a hmmscan type search (i.e. domains are in the database). Use the "hmmsearch_all" and "hmmsearch_clans" options when using a hmmsearch type search (i.e.

                        domains are the queries from -i file). The "clans" and "hmmsearch_clans" and options will only have effect for hits to/from Pfam. (default: None)

 

Annotation Options:

  --no_annot            Skip functional annotation, reporting only hits. (default: False)

  --dbmem               Use this option to allocate the whole eggnog.db DB in memory. Database will be unloaded after execution. (default: False)

  --seed_ortholog_evalue MIN_E-VALUE

                        Min E-value expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. (default: 0.001)

  --seed_ortholog_score MIN_SCORE

                        Min bit score expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. (default: None)

  --tax_scope TAX_SCOPE

                        Fix the taxonomic scope used for annotation, so only speciation events from a particular clade are used for functional transfer. More specifically, the --tax_scope list is intersected with the seed orthologs clades, and the resulting clades are used for annotation based on

                        --tax_scope_mode. Note that those seed orthologs without clades intersecting with --tax_scope will be filtered out, and won't annotated. Possible arguments for --tax_scope are: 1) A path to a file defined by the user, which contains a list of tax IDs and/or tax names. 2) The name

                        of a pre-configured tax scope, whose source is a file stored within the 'eggnogmapper/annotation/tax_scopes/' directory By default, available ones are: 'auto' ('all'), 'auto_broad' ('all_broad'), 'all_narrow', 'archaea', 'bacteria', 'bacteria_broad', 'eukaryota', 'eukaryota_broad'

                        and 'prokaryota_broad'.3) A comma-separated list of taxonomic names and/or taxonomic IDs, sorted by preference. An example of list of tax IDs would be 2759,2157,2,1 for Eukaryota, Archaea, Bacteria and root, in that order of preference. 4) 'none': do not filter out annotations

                        based on taxonomic scope. (default: auto)

  --tax_scope_mode TAX_SCOPE_MODE

                        For a seed ortholog which passed the filter imposed by --tax_scope, --tax_scope_mode controls which specific clade, to which the seed ortholog belongs, will be used for annotation. Options: 1) broadest: use the broadest clade. 2) inner_broadest: use the broadest clade from the

                        intersection with --tax_scope. 3) inner_narrowest: use the narrowest clade from the intersection with --tax_scope. 4) narrowest: use the narrowest clade. 5) A taxonomic scope as in --tax_scope: use this second list to intersect with seed ortholog clades and use the narrowest (as in

                        inner_narrowest) from the intersection to annotate. (default: inner_narrowest)

  --target_orthologs {one2one,many2one,one2many,many2many,all}

                        defines what type of orthologs (in relation to the seed ortholog) should be used for functional transfer (default: all)

  --target_taxa LIST_OF_TAX_IDS

                        Only orthologs from the specified comma-separated list of taxa and all its descendants will be used for annotation transference. By default, all taxa are used. (default: None)

  --excluded_taxa LIST_OF_TAX_IDS

                        Orthologs from the specified comma-separated list of taxa and all its descendants will not be used for annotation transference. By default, no taxa is excluded. (default: None)

  --report_orthologs    Output the list of orthologs found for each query to a .orthologs file (default: False)

  --go_evidence {experimental,non-electronic,all}

                        Defines what type of GO terms should be used for annotation. experimental = Use only terms inferred from experimental evidence. non-electronic = Use only non-electronically curated terms (default: non-electronic)

  --pfam_realign {none,realign,denovo}

                        Realign the queries to the PFAM domains. none = no realignment is performed. PFAM annotation will be that transferred as specify in the --pfam_transfer option. realign = queries will be realigned to the PFAM domains found according to the --pfam_transfer option. denovo = queries

                        will be realigned to the whole PFAM database, ignoring the --pfam_transfer option. Check hmmer options (--num_servers, --num_workers, --port, --end_port) to change how the hmmpgmd server is run. (default: none)

  --md5                 Adds the md5 hash of each query as an additional field in annotations output files. (default: False)

 

Output options:

  --output FILE_PREFIX, -o FILE_PREFIX

                        base name for output files (default: None)

  --output_dir DIR      Where output files should be written (default: /media/kazu/8TB7/eggNOG_DB)

  --scratch_dir DIR     Write output files in a temporary scratch dir, move them to the final output dir when finished. Speed up large computations using network file systems. (default: None)

  --temp_dir DIR        Where temporary files are created. Better if this is a local disk. (default: /media/kazu/8TB7/eggNOG_DB)

  --no_file_comments    No header lines nor stats are included in the output files (default: False)

  --decorate_gff DECORATE_GFF

                        Add search hits and/or annotation results to GFF file from gene prediction of a user specified one. no = no GFF decoration at all. GFF file from blastx-based gene prediction will be created anyway. yes = add search hits and/or annotations to GFF file from Prodigal or blastx-based

                        gene prediction. FILE = decorate the specified pre-existing GFF FILE. e.g. --decorage_gff myfile.gff You change the field interpreted as ID of the feature with --decorate_gff_ID_field. (default: no)

  --decorate_gff_ID_field DECORATE_GFF_ID_FIELD

                        Change the field used in GFF files as ID of the feature. (default: ID)

  --excel               Output annotations also in .xlsx format. (default: False)

> download_eggnog_data.py -h

$ download_eggnog_data.py -h

usage: download_eggnog_data.py [-h] [-D] [-F] [-P] [-M] [-H] [-d HMMER_DBS] [--dbname DBNAME] [-y] [-f] [-s] [-q] [--data_dir]

 

optional arguments:

  -h, --help       show this help message and exit

  -D               Do not install the diamond database (default: False)

  -F               Install the novel families diamond and annotation databases, required for "emapper.py -m novel_fams" (default: False)

  -P               Install the Pfam database, required for de novo annotation or realignment (default: False)

  -M               Install the MMseqs2 database, required for "emapper.py -m mmseqs" (default: False)

  -H               Install the HMMER database specified with "-d TAXID". Required for "emapper.py -m hmmer -d TAXID" (default: False)

  -d HMMER_DBS     Tax ID of eggNOG HMM database to download. e.g. "-H -d 2" for Bacteria. Required if "-H". Available tax IDs can be found at http://eggnog5.embl.de/#/app/downloads. (default: None)

  --dbname DBNAME  Tax ID of eggNOG HMM database to download. e.g. "-H -d 2 --dbname 'Bacteria'" to download Bacteria (taxid 2) to a directory named Bacteria (default: None)

  -y               assume "yes" to all questions (default: False)

  -f               forces download even if the files exist (default: False)

  -s               simulate and print commands. Nothing is downloaded (default: False)

  -q               quiet_mode (default: False)

  --data_dir       Directory to use for DATA_PATH. (default: None)

 

 

データベース

保存するディレクトリを指定してダウンロードスクリプトを実行する。

export EGGNOG_DATA_DIR=<path>/<to>/DB
download_eggnog_data.py
  • -D      Do not install the diamond database (default: False)
  • -F       Install the novel families diamond and annotation databases, required for "emapper.py -m novel_fams" (default: False)
  • -P       Install the Pfam database, required for de novo annotation or realignment (default: False)
  • -M      Install the MMseqs2 database, required for "emapper.py -m mmseqs" (default: False)
  • -H       Install the HMMER database specified with "-d TAXID". Required for "emapper.py -m hmmer -d TAXID" (default: False)

 

ダウンロードはyes|noの対話式で進める

2時間ほどかかった。

 

オプションなし、デフォルト設定で取得したDB

> ls -lh

 

実行方法

アノテーションをつけたい配列のfastaファイルを指定する。

mkdir outdir
emapper.py -i input.fasta -o outprefix --output_dir outdir --cpu 0 --itype proteins -m diamond
  • -i    Input FASTA file containing query sequences (proteins by default; see --itype and --translate). Required unless -m no_search. (default: None)
  • --itype {CDS,proteins,genome,metagenome}    Type of data in the input (-i) file. (default: proteins)
  • --translate     When --itype CDS, translate CDS to proteins before search. When --itype genome/metagenome and --genepred search, translate predicted CDS from blastx hits to proteins. (default: False)
  • --cpu    Number of CPUs to be used. --cpu 0 to run with all available CPUs. (default: 1)
  • --output    base name for output files (default: None)
  • --output_dir   Where output files should be written (default: current PATH)
  •  -m {diamond,mmseqs,hmmer,no_search,cache,novel_fams}    diamond: search seed orthologs using diamond (-i is required). mmseqs: search seed orthologs using MMseqs2 (-i is required). hmmer: search seed orthologs using HMMER. (-i is required). no_search: skip seed orthologs search (--annotate_hits_table is required, unless --no_annot).  cache: skip seed orthologs search and annotate based on cached results (-i and -c are required).novel_fams: search against the novel families database (-i is required). (default: diamond)

 

出力例

outprefix.emapper.annotations.tsv



 

引用
Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper

Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, Bork P

Mol Biol Evol. 2017 Aug 1;34(8):2115-2122

 

* Error running diamond: Error: Invalid option: iterateが出た。

diamondが古くダウンロードしたdiamond DBが読み取れないのが原因なので、以下の通り最新のdiamond binary(ユーザーのシステム上のライブラリを使用)をダウンロードしてパスを通した。

# downloading the tool
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.11/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz

 

関連