EnrichMは、大規模なメタゲノム・アセンブリゲノム(MAG)を対象とした比較ゲノミクスツール群である。現在の機能は以下の通りである。
- MAGの基本的なアノテーションパイプライン。
- KEGGモジュールを参考にして、MAGがコード化している代謝パスウェイを決定するパイプライン(ただし、カスタムパスウェイの指定も可能
- ユーザーが定義したゲノムグループ(機能的に関連するゲノム、系統的に関連するゲノム、異なる環境から回収されたゲノムなど)内およびグループ間で濃縮されている遺伝子や代謝パスウェイを同定するパイプライン。
- アノテーションされた集団ゲノムから代謝ネットワークを構築する。
- MAG、メタゲノム、トランスクリプトームのいずれかの機能組成から、ランダムフォレスト機械学習モデルを構築する。
- これらのランダムフォレスト機械学習モデルを、新しいMAGsメタゲノムの分類に適用する。
EnrichMは現在開発中のため、マスターが安定しているという保証はない。EnrichMはpypiまたはconda(下記参照)を使ってダウンロードすることを勧める。
help(現在は情報なし)
https://github.com/geronimp/enrichM/wiki/
インストール
condaでpython3.7の環境を作って、pipも使って導入した。
依存
EnrichM is written in python 3, and required >v3.6 to run.
- hmmer >= 3.1b
- diamond == 0.9.22
- prodigal >= 2.6.3
- parallel >= 20180222
- mmseqs >= 2-23394
- R >= 3.0.1
- mcl >= 14-137
#pip (pypi)
pip3 install enrichm
#bioconda (link)
mamba install -c bioconda -y enrichm
#ここではcondaで環境を作ってpipとcondaを使い分けて入れる
mamba create -n enrichm python=3.7 -y
conda activate enrichm
pip3 install enrichm
mamba install -c conda-forge parallel -y
mamba install -c bioconda -y hmmer
mamba install -c conda-forge -c bioconda mmseqs2 -y
mamba install -c bioconda prodigal -y
mamba install -c bioconda orthofinder -y
#fuzzywuzzyとIPythonがなかったので追加で導入
pip install fuzzywuzzy IPython
> enrichm
_____ _ _ __ __
| ____|_ __ _ __(_) ___| |__ | \/ |
| _| | '_ \| '__| |/ __| '_ \| |\/| |
| |___| | | | | | | (__| | | | | | |
|_____|_| |_|_| |_|\___|_| |_|_| |_|
------------------------------------------------------------------------------------
Annotation
annotate -> Genome annotation pipeline.
Enrichment analysis
classify -> Determine what pathways a genome encodes.
enrichment -> Calculate enrichment of functional genes between groups.
Network analysis
pathway -> Generate a metabolic network from specific KEGG module or
compounds.
explore -> Take steps into metabolism using a KEGG compound ID as a
starting point. Useful to see which pathways use a compound
of interest.
Machine learning
generate -> Generate a random forest model.
predict -> Run random forest model on new data.
Authors: Joel Boyd, Ben Woodcroft, Alex Baker
Version: 0.6.3
> enrichm annotate
# enrichm annotate
/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
usage: enrichm annotate [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] [--genome_files GENOME_FILES [GENOME_FILES ...]] [--genome_directory GENOME_DIRECTORY] [--protein_files PROTEIN_FILES [PROTEIN_FILES ...]] [--protein_directory PROTEIN_DIRECTORY] [--ko]
[--ko_hmm] [--pfam] [--tigrfam] [--clusters] [--orthologs] [--orthogroup] [--cazy] [--ec] [--cut_ga_pfam] [--cut_nc_pfam] [--cut_tc_pfam] [--cut_ga_tigrfam] [--cut_nc_tigrfam] [--cut_tc_tigrfam] [--cut_ko] [--evalue EVALUE] [--bit BIT] [--id ID]
[--aln_query ALN_QUERY] [--aln_reference ALN_REFERENCE] [--c C] [--threads THREADS] [--parallel PARALLEL] [--inflation INFLATION] [--suffix SUFFIX] [--light] [--count_domains] [--chunk_number CHUNK_NUMBER] [--chunk_max CHUNK_MAX]
optional arguments:
-h, --help show this help message and exit
Logging options:
--log LOG Output logging information to this file.
--verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
--output OUTPUT Output directory
--force Overwrite previous run
Input options (Use one):
--genome_files GENOME_FILES [GENOME_FILES ...]
Space separated list of genomes to annotate
--genome_directory GENOME_DIRECTORY
Directory containing genomes to annotate
--protein_files PROTEIN_FILES [PROTEIN_FILES ...]
Space separated list of .faa files of genomes to annotate. Protein files need to be generated by prodigal.
--protein_directory PROTEIN_DIRECTORY
Directory containing .faa files of genomes to annotate. Protein files need to be generated by prodigal.
Annotations options:
--ko Annotate with KO ids
--ko_hmm Annotate with KO ids
--pfam Annotate with Pfam HMMs
--tigrfam Annotate with TIGRFAM HMMs
--clusters Annotate with cluster ids
--orthologs Annotate with ortholog ids
--orthogroup Annotate with orthogroup ids with OrthoFinder
--cazy Annotate with dbCAN HMMs
--ec Annotate with EC ids
Cutoff options:
--cut_ga_pfam For PFAM searches: use profiles GA gathering cutoffs to set all thresholding
--cut_nc_pfam For PFAM searches: use profiles NC noise cutoffs to set all thresholding
--cut_tc_pfam For PFAM searches: use profiles TC trusted cutoffs to set all thresholding
--cut_ga_tigrfam For TIGRfam searches: use profiles GA gathering cutoffs to set all thresholding
--cut_nc_tigrfam For TIGRfam searches: use profiles NC noise cutoffs to set all thresholding
--cut_tc_tigrfam For TIGRfam searches: use profiles TC trusted cutoffs to set all thresholding
--cut_ko For KO HMM annotation searches: use cutoffs defined by KEGG to maximise F-score.
--evalue EVALUE Use this evalue cutoff to filter false positives (default: 1e-05)
--bit BIT Use this bit score cutoff to filter false positives (default: 0)
--id ID Use this percent identity cutoff to filter false positives (default: 0.3)
--aln_query ALN_QUERY
This fraction of the query must align to the reference (default: 0.7)
--aln_reference ALN_REFERENCE
This fraction of the reference must align to the query (default: 0.7)
--c C When clustering, use matches above this fraction of aligned (covered) query and target residues (default: 0.8)
Runtime options:
--threads THREADS Use this number of threads when annotating with BLAST and HMMsearch (default: 1)
--parallel PARALLEL Run this number of jobs in parallel when annotating with HMMsearch (default: 5)
--inflation INFLATION
Inflation factor to use when calling clusters in ortholog (default = 1.5)
--suffix SUFFIX Treat files ending with this suffix within the --genome_directory as genomes (default: .fna for --genome_directory and .faa for )
--light Don't store metadata for genome files (can't use enrichM compare downstream, default=False)
--count_domains Fill the frequency matrix with the total number of times an annotation was detected (for example, when one domain more than once within a protein), rather than the count of proteins with with that annotation
--chunk_number CHUNK_NUMBER
Split loading of genomes into this many chunks (default = 4)
--chunk_max CHUNK_MAX
Maximum number of genomes to load per chunk (default = 2500)
None
> enrichm classify
# enrichm classify
/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
usage: enrichm classify [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] [--genome_and_annotation_matrix GENOME_AND_ANNOTATION_MATRIX] [--custom_modules CUSTOM_MODULES] [--module_rules_json MODULE_RULES_JSON] [--gff_files GFF_FILES] [--cutoff CUTOFF] [--aggregate]
optional arguments:
-h, --help show this help message and exit
Logging options:
--log LOG Output logging information to this file.
--verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
--output OUTPUT Output directory
--force Overwrite previous run
Input options:
--genome_and_annotation_matrix GENOME_AND_ANNOTATION_MATRIX
Path to file containing a genome annotation matrix
--custom_modules CUSTOM_MODULES
Tab separated file containing module name, definition as the columns
--module_rules_json MODULE_RULES_JSON
json file specifying rules to interpret the annotation and guide module annotation
--gff_files GFF_FILES
GFF files for the genomes being classified.
Cutoff options:
--cutoff CUTOFF Output only modules with greater than this percent of the requied KO groups (default = print all modules)
Runtime options:
--aggregate Calculate the abundance of each pathway within each genome/sample (column)
None
> enrichm enrichment
# enrichm enrichment
/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
usage: enrichm enrichment [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] [--annotate_output ANNOTATE_OUTPUT] [--metadata METADATA] [--annotation_matrix ANNOTATION_MATRIX] [--gff_files GFF_FILES [GFF_FILES ...]] [--abundance ABUNDANCE]
[--abundance_metadata ABUNDANCE_METADATA] [--transcriptome TRANSCRIPTOME] [--transcriptome_metadata TRANSCRIPTOME_METADATA] [--batchfile BATCHFILE] [--pval_cutoff PVAL_CUTOFF] [--proportions_cutoff PROPORTIONS_CUTOFF] [--threshold THRESHOLD]
[--multi_test_correction MULTI_TEST_CORRECTION] [--processes PROCESSES] [--allow_negative_values] [--ko] [--ko_hmm] [--pfam] [--tigrfam] [--cluster] [--ortholog] [--cazy] [--ec] [--range RANGE] [--subblock_size SUBBLOCK_SIZE]
[--operon_mismatch_cutoff OPERON_MISMATCH_CUTOFF] [--operon_match_score_cutoff OPERON_MATCH_SCORE_CUTOFF]
optional arguments:
-h, --help show this help message and exit
Logging options:
--log LOG Output logging information to this file.
--verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
--output OUTPUT Output directory
--force Overwrite previous run
Input options:
--annotate_output ANNOTATE_OUTPUT
Output directory provided by enrichm annotate
--metadata METADATA Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
--annotation_matrix ANNOTATION_MATRIX
Annotation matrix to compare.
--gff_files GFF_FILES [GFF_FILES ...]
Gff files for genomes to compare.
--abundance ABUNDANCE
Genome abundance matrix.
--abundance_metadata ABUNDANCE_METADATA
Metadata grouping abundance samples.
--transcriptome TRANSCRIPTOME
Genome abundance matrix.
--transcriptome_metadata TRANSCRIPTOME_METADATA
Metadata grouping abundance samples.
Genome Taxonomy DataBase (GTDB) options:
--batchfile BATCHFILE
metadata file to compare with.
Runtime options:
--pval_cutoff PVAL_CUTOFF
Only output results with a p-value below a this cutoff (default=0.05).
--proportions_cutoff PROPORTIONS_CUTOFF
Proportion enrichment cutoff.
--threshold THRESHOLD
The threshold to control for in false discovery rate of familywise error rate.
--multi_test_correction MULTI_TEST_CORRECTION
The form of mutiple test correction to use. Uses the statsmodel module and consequently has all of its options.
Default: Benjamini-Hochberg FDR (fdr_bh)
Options: Bonferroni (b)
Sidak (s)
Holm (h)
Holm-Sidak (hs)
Simes-Hochberg (sh)
Hommel (ho)
FDR Benjamini-Yekutieli (fdr_by)
FDR 2-stage Benjamini-Hochberg (fdr_tsbh)
FDR 2-stage Benjamini-Krieger-Yekutieli (fdr_tsbky)
FDR adaptive Gavrilov-Benjamini-Sarkar (fdr_gbs))
--processes PROCESSES
Number of processes to use for enrichment.
--allow_negative_values
Allow negative values in input matrix.
--ko Compare KO ids (annotated with DIAMOND)
--ko_hmm Compare KO ids (annotated with HMMs)
--pfam Compare Pfam ids
--tigrfam Compare TIGRFAM ids
--cluster Compare cluster ids
--ortholog Compare ortholog ids
--cazy Compare dbCAN ids
--ec Compare EC ids
--range RANGE Base pair range to search for synteny within. Default = 2500.
--subblock_size SUBBLOCK_SIZE
Number of genes clustered in a row to be reported. Default = 2.
--operon_mismatch_cutoff OPERON_MISMATCH_CUTOFF
Number of allowed mismatches when searching for operons across genomes. Default = 2.
--operon_match_score_cutoff OPERON_MATCH_SCORE_CUTOFF
Score cutoff for operon matches
None
> enrichm pathway -h
# enrichm pathway -h
/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
usage: enrichm pathway [-h] --matrix MATRIX [--genome_metadata GENOME_METADATA] [--abundance ABUNDANCE] [--abundance_metadata ABUNDANCE_METADATA] [--tpm_values TPM_VALUES] [--tpm_metadata TPM_METADATA] [--metabolome METABOLOME] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT]
[--force] [--limit LIMIT [LIMIT ...]] [--filter FILTER [FILTER ...]] [--enrichment_output ENRICHMENT_OUTPUT]
optional arguments:
-h, --help show this help message and exit
Input options:
--matrix MATRIX KO matrix. REQUIRED.
--genome_metadata GENOME_METADATA
Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
--abundance ABUNDANCE
Abundance matrix.
--abundance_metadata ABUNDANCE_METADATA
Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
TPM values produced by DetectM.
Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
--metabolome METABOLOME
Metabolome CID matrix.
Logging options:
--log LOG Output logging information to this file.
--verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
--output OUTPUT Output directory
--force Overwrite previous run
Pathway options:
--limit LIMIT [LIMIT ...]
USE ONLY these reactions, or reactions within this pathway or module (space separated list).
--filter FILTER [FILTER ...]
Do not use these reactions, or reactions within this pathway or module (space separated list).
--enrichment_output ENRICHMENT_OUTPUT
Supply an enrichment output to integrate the results into the output network.
> enrichm explore -h
# enrichm explore -h
/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
usage: enrichm explore [-h] --matrix MATRIX [--genome_metadata GENOME_METADATA] [--abundance ABUNDANCE] [--abundance_metadata ABUNDANCE_METADATA] [--tpm_values TPM_VALUES] [--tpm_metadata TPM_METADATA] [--metabolome METABOLOME] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT]
[--force] --queries QUERIES [--depth DEPTH]
optional arguments:
-h, --help show this help message and exit
Input options:
--matrix MATRIX KO matrix. REQUIRED.
--genome_metadata GENOME_METADATA
Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
--abundance ABUNDANCE
Abundance matrix.
--abundance_metadata ABUNDANCE_METADATA
Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
TPM values produced by DetectM.
Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
--metabolome METABOLOME
Metabolome CID matrix.
Logging options:
--log LOG Output logging information to this file.
--verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
--output OUTPUT Output directory
--force Overwrite previous run
Query options:
--queries QUERIES A file containing the KEGG ids of the compounds from which to start in the metabolic network
--depth DEPTH Number of steps to take into the metabolic network
> enrichm generate -h
# enrichm generate -h
/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
usage: enrichm generate [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] --input_matrix INPUT_MATRIX --groups GROUPS --model_type {regressor,classifier} [--testing_portion TESTING_PORTION] [--grid_search] [--threads THREADS]
optional arguments:
-h, --help show this help message and exit
Logging options:
--log LOG Output logging information to this file.
--verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
--output OUTPUT Output directory
--force Overwrite previous run
Generate options:
--input_matrix INPUT_MATRIX
input matrix of results
--groups GROUPS defined outcomes to train the data to
--model_type {regressor,classifier}
regressor or classifier
--testing_portion TESTING_PORTION
portion of the input data to use for testing (default = 0.2)
Tuning options:
--grid_search grid search
--threads THREADS number of threads to use for hyperparameterization (default = all available)
> enrichm predict -h
# enrichm predict -h
/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
usage: enrichm predict [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] --forester_model_directory FORESTER_MODEL_DIRECTORY --input_matrix INPUT_MATRIX
optional arguments:
-h, --help show this help message and exit
Logging options:
--log LOG Output logging information to this file.
--verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
--output OUTPUT Output directory
--force Overwrite previous run
Predict options:
--forester_model_directory FORESTER_MODEL_DIRECTORY
Pickled model to use
--input_matrix INPUT_MATRIX
matrix of data to predict
データベース
EnrichM を実行する前にデータベース(5.7G)をダウンロードする。
mkdir ~/databases
enrichm data --create
EnrichMがゲノムのアノテーションや比較を行うために必要な全ての参照データベース(
Pfam-A HMM、TIGRfam HMM、ECおよびKOアノテーション付きのuniref100の配列のDIAMONDデータベース、およびKoFamKOALA HMM)が含まれている。デフォルトではホームディレクトリのdatabasesにダウンロードされる(--outputオプションで指定可とあるが、バージョンによってはこのオプションが存在しない)。
ダウンロード後にホーム以外に変更した時は、bashの環境変数ENRICHM_DBとしてエクスポートする。
export ENRICHM_DB=/path/to/database/
実行方法
annotate
dbCANを使用して、集団のゲノムにKO、PFAM、TIGRFAM、CAZYのアノテーションをつける。fastaファイルの拡張子は.fnaである必要がある。
enrichm annotate --genome_directory MAGs_dir/ --output outdir --threads 12
出力例
サブディレクトリ
追加のアノテーションを行う。
annotate --output /data/outdir --genome_directory /data/fasta/ --threads 30 --ko --ko_hmm --pfam --tigrfam --clusters --orthologs --orthogroup --cazy --ec
出力
他にもいくつかコマンドがあります。
classify
annotateコマンドの出力のKOアノテーションをマトリックス形式(KO IDを行、ゲノムを列)で読み込んで、どのKEGGモジュールが完全であるかを判断する。
enrichment
KOやPFAMのアノテーションをマトリックス(IDが行、ゲノムが列)と、ゲノムを比較するグループに分けたメタデータファイルの形で読み込んで、基本的な統計を実行する。さらに、グループ間やグループ内でのモジュールやpfamクランのエンリッチメントを判定する。
pathway
KOマトリックスを読み込んで、Cytoscapeで読める代謝ネットワークとメタデータファイルを生成する。
explore
pathwayと似ているが、指定されたパスウェイを生成するのではなく、指定されたクエリ化合物IDから出発し、入力されたKOマトリックスに存在する酵素を考慮して、その化合物を使用する可能性のある反応を探索する。
引用
https://github.com/geronimp/enrichM
Comparative genomics using EnrichM. Joel A Boyd Ben J Woodcroft Gene W Tyson. 2019. In preparation.
関連