macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

大規模なメタゲノムアセンブリ(MAG)を対象とした比較ゲノミクスツール群 EnrichM

 

EnrichMは、大規模なメタゲノム・アセンブリゲノム(MAG)を対象とした比較ゲノミクスツール群である。現在の機能は以下の通りである。

  1. MAGの基本的なアノテーションパイプライン。
  2. KEGGモジュールを参考にして、MAGがコード化している代謝パスウェイを決定するパイプライン(ただし、カスタムパスウェイの指定も可能
  3. ユーザーが定義したゲノムグループ(機能的に関連するゲノム、系統的に関連するゲノム、異なる環境から回収されたゲノムなど)内およびグループ間で濃縮されている遺伝子や代謝パスウェイを同定するパイプライン。
  4. アノテーションされた集団ゲノムから代謝ネットワークを構築する。
  5. MAG、メタゲノム、トランスクリプトームのいずれかの機能組成から、ランダムフォレスト機械学習モデルを構築する。
  6. これらのランダムフォレスト機械学習モデルを、新しいMAGsメタゲノムの分類に適用する。

EnrichMは現在開発中のため、マスターが安定しているという保証はない。EnrichMはpypiまたはconda(下記参照)を使ってダウンロードすることを勧める。

 

help(現在は情報なし)

https://github.com/geronimp/enrichM/wiki/

 

インストール

condaでpython3.7の環境を作って、pipも使って導入した。

依存

EnrichM is written in python 3, and required >v3.6 to run. 

  • hmmer >= 3.1b
  • diamond == 0.9.22
  • prodigal >= 2.6.3
  • parallel >= 20180222
  • mmseqs >= 2-23394
  • R >= 3.0.1
  • mcl >= 14-137

Github

#pip (pypi)
pip3 install enrichm

#bioconda (link)
mamba install -c bioconda -y enrichm

#ここではcondaで環境を作ってpipとcondaを使い分けて入れる
mamba create -n enrichm python=3.7 -y
conda activate enrichm
pip3 install enrichm
mamba install -c conda-forge parallel -y
mamba install -c bioconda -y hmmer
mamba install -c conda-forge -c bioconda mmseqs2 -y
mamba install -c bioconda prodigal -y
mamba install -c bioconda orthofinder -y

#fuzzywuzzyとIPythonがなかったので追加で導入
pip install fuzzywuzzy IPython

> enrichm

 

_____ _ _ __ __ 

| ____|_ __ _ __(_) ___| |__ | \/ |

| _| | '_ \| '__| |/ __| '_ \| |\/| |

| |___| | | | | | | (__| | | | | | |

|_____|_| |_|_| |_|\___|_| |_|_| |_|

------------------------------------------------------------------------------------

 

Annotation

annotate -> Genome annotation pipeline.

 

Enrichment analysis

classify -> Determine what pathways a genome encodes.

enrichment -> Calculate enrichment of functional genes between groups.

 

Network analysis

pathway -> Generate a metabolic network from specific KEGG module or

compounds.

explore -> Take steps into metabolism using a KEGG compound ID as a 

starting point. Useful to see which pathways use a compound

of interest.

 

Machine learning

generate -> Generate a random forest model.

predict -> Run random forest model on new data.

 

Authors: Joel Boyd, Ben Woodcroft, Alex Baker

Version: 0.6.3

 

> enrichm annotate

# enrichm annotate

/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning

  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

usage: enrichm annotate [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] [--genome_files GENOME_FILES [GENOME_FILES ...]] [--genome_directory GENOME_DIRECTORY] [--protein_files PROTEIN_FILES [PROTEIN_FILES ...]] [--protein_directory PROTEIN_DIRECTORY] [--ko]

                        [--ko_hmm] [--pfam] [--tigrfam] [--clusters] [--orthologs] [--orthogroup] [--cazy] [--ec] [--cut_ga_pfam] [--cut_nc_pfam] [--cut_tc_pfam] [--cut_ga_tigrfam] [--cut_nc_tigrfam] [--cut_tc_tigrfam] [--cut_ko] [--evalue EVALUE] [--bit BIT] [--id ID]

                        [--aln_query ALN_QUERY] [--aln_reference ALN_REFERENCE] [--c C] [--threads THREADS] [--parallel PARALLEL] [--inflation INFLATION] [--suffix SUFFIX] [--light] [--count_domains] [--chunk_number CHUNK_NUMBER] [--chunk_max CHUNK_MAX]

 

optional arguments:

  -h, --help            show this help message and exit

 

Logging options:

  --log LOG             Output logging information to this file.

  --verbosity VERBOSITY

                        Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent

 

Output options:

  --output OUTPUT       Output directory

  --force               Overwrite previous run

 

Input options (Use one):

  --genome_files GENOME_FILES [GENOME_FILES ...]

                        Space separated list of genomes to annotate

  --genome_directory GENOME_DIRECTORY

                        Directory containing genomes to annotate

  --protein_files PROTEIN_FILES [PROTEIN_FILES ...]

                        Space separated list of .faa files of genomes to annotate. Protein files need to be generated by prodigal.

  --protein_directory PROTEIN_DIRECTORY

                        Directory containing .faa files of genomes to annotate. Protein files need to be generated by prodigal.

 

Annotations options:

  --ko                  Annotate with KO ids

  --ko_hmm              Annotate with KO ids

  --pfam                Annotate with Pfam HMMs

  --tigrfam             Annotate with TIGRFAM HMMs

  --clusters            Annotate with cluster ids

  --orthologs           Annotate with ortholog ids

  --orthogroup          Annotate with orthogroup ids with OrthoFinder

  --cazy                Annotate with dbCAN HMMs

  --ec                  Annotate with EC ids

 

Cutoff options:

  --cut_ga_pfam         For PFAM searches: use profiles GA gathering cutoffs to set all thresholding

  --cut_nc_pfam         For PFAM searches: use profiles NC noise cutoffs to set all thresholding

  --cut_tc_pfam         For PFAM searches: use profiles TC trusted cutoffs to set all thresholding

  --cut_ga_tigrfam      For TIGRfam searches: use profiles GA gathering cutoffs to set all thresholding

  --cut_nc_tigrfam      For TIGRfam searches: use profiles NC noise cutoffs to set all thresholding

  --cut_tc_tigrfam      For TIGRfam searches: use profiles TC trusted cutoffs to set all thresholding

  --cut_ko              For KO HMM annotation searches: use cutoffs defined by KEGG to maximise F-score.

  --evalue EVALUE       Use this evalue cutoff to filter false positives (default: 1e-05)

  --bit BIT             Use this bit score cutoff to filter false positives (default: 0)

  --id ID               Use this percent identity cutoff to filter false positives (default: 0.3)

  --aln_query ALN_QUERY

                        This fraction of the query must align to the reference (default: 0.7)

  --aln_reference ALN_REFERENCE

                        This fraction of the reference must align to the query (default: 0.7)

  --c C                 When clustering, use matches above this fraction of aligned (covered) query and target residues (default: 0.8)

 

Runtime options:

  --threads THREADS     Use this number of threads when annotating with BLAST and HMMsearch (default: 1)

  --parallel PARALLEL   Run this number of jobs in parallel when annotating with HMMsearch (default: 5)

  --inflation INFLATION

                        Inflation factor to use when calling clusters in ortholog (default = 1.5)

  --suffix SUFFIX       Treat files ending with this suffix within the --genome_directory as genomes (default: .fna for --genome_directory and .faa for )

  --light               Don't store metadata for genome files (can't use enrichM compare downstream, default=False)

  --count_domains       Fill the frequency matrix with the total number of times an annotation was detected (for example, when one domain more than once within a protein), rather than the count of proteins with with that annotation

  --chunk_number CHUNK_NUMBER

                        Split loading of genomes into this many chunks (default = 4)

  --chunk_max CHUNK_MAX

                        Maximum number of genomes to load per chunk (default = 2500)

None

enrichm classify

# enrichm classify

/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning

  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

usage: enrichm classify [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] [--genome_and_annotation_matrix GENOME_AND_ANNOTATION_MATRIX] [--custom_modules CUSTOM_MODULES] [--module_rules_json MODULE_RULES_JSON] [--gff_files GFF_FILES] [--cutoff CUTOFF] [--aggregate]

 

optional arguments:

  -h, --help            show this help message and exit

 

Logging options:

  --log LOG             Output logging information to this file.

  --verbosity VERBOSITY

                        Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent

 

Output options:

  --output OUTPUT       Output directory

  --force               Overwrite previous run

 

Input options:

  --genome_and_annotation_matrix GENOME_AND_ANNOTATION_MATRIX

                        Path to file containing a genome annotation matrix

  --custom_modules CUSTOM_MODULES

                        Tab separated file containing module name, definition as the columns

  --module_rules_json MODULE_RULES_JSON

                        json file specifying rules to interpret the annotation and guide module annotation

  --gff_files GFF_FILES

                        GFF files for the genomes being classified.

 

Cutoff options:

  --cutoff CUTOFF       Output only modules with greater than this percent of the requied KO groups (default = print all modules)

 

Runtime options:

  --aggregate           Calculate the abundance of each pathway within each genome/sample (column)

None

> enrichm enrichment

# enrichm enrichment

/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning

  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

usage: enrichm enrichment [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] [--annotate_output ANNOTATE_OUTPUT] [--metadata METADATA] [--annotation_matrix ANNOTATION_MATRIX] [--gff_files GFF_FILES [GFF_FILES ...]] [--abundance ABUNDANCE]

                          [--abundance_metadata ABUNDANCE_METADATA] [--transcriptome TRANSCRIPTOME] [--transcriptome_metadata TRANSCRIPTOME_METADATA] [--batchfile BATCHFILE] [--pval_cutoff PVAL_CUTOFF] [--proportions_cutoff PROPORTIONS_CUTOFF] [--threshold THRESHOLD]

                          [--multi_test_correction MULTI_TEST_CORRECTION] [--processes PROCESSES] [--allow_negative_values] [--ko] [--ko_hmm] [--pfam] [--tigrfam] [--cluster] [--ortholog] [--cazy] [--ec] [--range RANGE] [--subblock_size SUBBLOCK_SIZE]

                          [--operon_mismatch_cutoff OPERON_MISMATCH_CUTOFF] [--operon_match_score_cutoff OPERON_MATCH_SCORE_CUTOFF]

 

optional arguments:

  -h, --help            show this help message and exit

 

Logging options:

  --log LOG             Output logging information to this file.

  --verbosity VERBOSITY

                        Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent

 

Output options:

  --output OUTPUT       Output directory

  --force               Overwrite previous run

 

Input options:

  --annotate_output ANNOTATE_OUTPUT

                        Output directory provided by enrichm annotate

  --metadata METADATA   Metadata file with two columns, the first with the genome name, the second with the groupings to compare.

  --annotation_matrix ANNOTATION_MATRIX

                        Annotation matrix to compare.

  --gff_files GFF_FILES [GFF_FILES ...]

                        Gff files for genomes to compare.

  --abundance ABUNDANCE

                        Genome abundance matrix.

  --abundance_metadata ABUNDANCE_METADATA

                        Metadata grouping abundance samples.

  --transcriptome TRANSCRIPTOME

                        Genome abundance matrix.

  --transcriptome_metadata TRANSCRIPTOME_METADATA

                        Metadata grouping abundance samples.

 

Genome Taxonomy DataBase (GTDB) options:

  --batchfile BATCHFILE

                        metadata file to compare with.

 

Runtime options:

  --pval_cutoff PVAL_CUTOFF

                        Only output results with a p-value below a this cutoff (default=0.05).

  --proportions_cutoff PROPORTIONS_CUTOFF

                        Proportion enrichment cutoff.

  --threshold THRESHOLD

                        The threshold to control for in false discovery rate of familywise error rate.

  --multi_test_correction MULTI_TEST_CORRECTION

                        The form of mutiple test correction to use. Uses the statsmodel module and consequently has all of its options.

                        Default: Benjamini-Hochberg FDR (fdr_bh) 

                        Options: Bonferroni (b) 

                         Sidak (s) 

                         Holm (h) 

                         Holm-Sidak (hs) 

                         Simes-Hochberg (sh) 

                         Hommel (ho) 

                         FDR Benjamini-Yekutieli (fdr_by) 

                         FDR 2-stage Benjamini-Hochberg (fdr_tsbh) 

                         FDR 2-stage Benjamini-Krieger-Yekutieli (fdr_tsbky) 

                         FDR adaptive Gavrilov-Benjamini-Sarkar (fdr_gbs))

  --processes PROCESSES

                        Number of processes to use for enrichment.

  --allow_negative_values

                        Allow negative values in input matrix.

  --ko                  Compare KO ids (annotated with DIAMOND)

  --ko_hmm              Compare KO ids (annotated with HMMs)

  --pfam                Compare Pfam ids

  --tigrfam             Compare TIGRFAM ids

  --cluster             Compare cluster ids

  --ortholog            Compare ortholog ids

  --cazy                Compare dbCAN ids

  --ec                  Compare EC ids

  --range RANGE         Base pair range to search for synteny within. Default = 2500.

  --subblock_size SUBBLOCK_SIZE

                        Number of genes clustered in a row to be reported. Default = 2.

  --operon_mismatch_cutoff OPERON_MISMATCH_CUTOFF

                        Number of allowed mismatches when searching for operons across genomes. Default = 2.

  --operon_match_score_cutoff OPERON_MATCH_SCORE_CUTOFF

                        Score cutoff for operon matches

None

> enrichm pathway -h

# enrichm pathway -h

/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning

  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

usage: enrichm pathway [-h] --matrix MATRIX [--genome_metadata GENOME_METADATA] [--abundance ABUNDANCE] [--abundance_metadata ABUNDANCE_METADATA] [--tpm_values TPM_VALUES] [--tpm_metadata TPM_METADATA] [--metabolome METABOLOME] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT]

                       [--force] [--limit LIMIT [LIMIT ...]] [--filter FILTER [FILTER ...]] [--enrichment_output ENRICHMENT_OUTPUT]

 

optional arguments:

  -h, --help            show this help message and exit

 

Input options:

  --matrix MATRIX       KO matrix. REQUIRED.

  --genome_metadata GENOME_METADATA

                        Metadata file with two columns, the first with the genome name, the second with the groupings to compare.

  --abundance ABUNDANCE

                        Abundance matrix.

  --abundance_metadata ABUNDANCE_METADATA

                        Metadata file with two columns, the first with the genome name, the second with the groupings to compare.

  --tpm_values TPM_VALUES

                        TPM values produced by DetectM.

  --tpm_metadata TPM_METADATA

                        Metadata file with two columns, the first with the genome name, the second with the groupings to compare.

  --metabolome METABOLOME

                        Metabolome CID matrix.

 

Logging options:

  --log LOG             Output logging information to this file.

  --verbosity VERBOSITY

                        Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent

 

Output options:

  --output OUTPUT       Output directory

  --force               Overwrite previous run

 

Pathway options:

  --limit LIMIT [LIMIT ...]

                        USE ONLY these reactions, or reactions within this pathway or module (space separated list).

  --filter FILTER [FILTER ...]

                        Do not use these reactions, or reactions within this pathway or module (space separated list).

  --enrichment_output ENRICHMENT_OUTPUT

                        Supply an enrichment output to integrate the results into the output network.

> enrichm  explore -h

# enrichm  explore -h

/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning

  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

usage: enrichm explore [-h] --matrix MATRIX [--genome_metadata GENOME_METADATA] [--abundance ABUNDANCE] [--abundance_metadata ABUNDANCE_METADATA] [--tpm_values TPM_VALUES] [--tpm_metadata TPM_METADATA] [--metabolome METABOLOME] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT]

                       [--force] --queries QUERIES [--depth DEPTH]

 

optional arguments:

  -h, --help            show this help message and exit

 

Input options:

  --matrix MATRIX       KO matrix. REQUIRED.

  --genome_metadata GENOME_METADATA

                        Metadata file with two columns, the first with the genome name, the second with the groupings to compare.

  --abundance ABUNDANCE

                        Abundance matrix.

  --abundance_metadata ABUNDANCE_METADATA

                        Metadata file with two columns, the first with the genome name, the second with the groupings to compare.

  --tpm_values TPM_VALUES

                        TPM values produced by DetectM.

  --tpm_metadata TPM_METADATA

                        Metadata file with two columns, the first with the genome name, the second with the groupings to compare.

  --metabolome METABOLOME

                        Metabolome CID matrix.

 

Logging options:

  --log LOG             Output logging information to this file.

  --verbosity VERBOSITY

                        Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent

 

Output options:

  --output OUTPUT       Output directory

  --force               Overwrite previous run

 

Query options:

  --queries QUERIES     A file containing the KEGG ids of the compounds from which to start in the metabolic network

  --depth DEPTH         Number of steps to take into the metabolic network

enrichm generate -h

# enrichm generate -h

/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning

  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

usage: enrichm generate [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] --input_matrix INPUT_MATRIX --groups GROUPS --model_type {regressor,classifier} [--testing_portion TESTING_PORTION] [--grid_search] [--threads THREADS]

 

optional arguments:

  -h, --help            show this help message and exit

 

Logging options:

  --log LOG             Output logging information to this file.

  --verbosity VERBOSITY

                        Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent

 

Output options:

  --output OUTPUT       Output directory

  --force               Overwrite previous run

 

Generate options:

  --input_matrix INPUT_MATRIX

                        input matrix of results

  --groups GROUPS       defined outcomes to train the data to

  --model_type {regressor,classifier}

                        regressor or classifier

  --testing_portion TESTING_PORTION

                        portion of the input data to use for testing (default = 0.2)

 

Tuning options:

  --grid_search         grid search

  --threads THREADS     number of threads to use for hyperparameterization (default = all available)

> enrichm predict -h    

# enrichm predict -h    

/root/miniconda3/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning

  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

usage: enrichm predict [-h] [--log LOG] [--verbosity VERBOSITY] [--output OUTPUT] [--force] --forester_model_directory FORESTER_MODEL_DIRECTORY --input_matrix INPUT_MATRIX

 

optional arguments:

  -h, --help            show this help message and exit

 

Logging options:

  --log LOG             Output logging information to this file.

  --verbosity VERBOSITY

                        Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent

 

Output options:

  --output OUTPUT       Output directory

  --force               Overwrite previous run

 

Predict options:

  --forester_model_directory FORESTER_MODEL_DIRECTORY

                        Pickled model to use

  --input_matrix INPUT_MATRIX

                        matrix of data to predict

 

 

データベース

EnrichM を実行する前にデータベース(5.7G)をダウンロードする。

mkdir ~/databases
enrichm data --create

EnrichMがゲノムのアノテーションや比較を行うために必要な全ての参照データベース(

Pfam-A HMM、TIGRfam HMM、ECおよびKOアノテーション付きのuniref100の配列のDIAMONDデータベース、およびKoFamKOALA HMM)が含まれている。デフォルトではホームディレクトリのdatabasesにダウンロードされる(--outputオプションで指定可とあるが、バージョンによってはこのオプションが存在しない)。

 

ダウンロード後にホーム以外に変更した時は、bash環境変数ENRICHM_DBとしてエクスポートする。

export ENRICHM_DB=/path/to/database/

 

 

実行方法

annotate  

dbCANを使用して、集団のゲノムにKO、PFAM、TIGRFAM、CAZYのアノテーションをつける。fastaファイルの拡張子は.fnaである必要がある。

enrichm annotate --genome_directory MAGs_dir/ --output outdir --threads 12

出力例

f:id:kazumaxneo:20211013122941p:plain

サブディレクト

f:id:kazumaxneo:20211013123218p:plain

 

追加のアノテーションを行う。

annotate --output /data/outdir --genome_directory /data/fasta/ --threads 30 --ko --ko_hmm --pfam --tigrfam --clusters --orthologs --orthogroup --cazy --ec

出力

f:id:kazumaxneo:20211013140231p:plain



 

他にもいくつかコマンドがあります。

classify

annotateコマンドの出力のKOアノテーションマトリックス形式(KO IDを行、ゲノムを列)で読み込んで、どのKEGGモジュールが完全であるかを判断する。

enrichment

KOやPFAMのアノテーションマトリックス(IDが行、ゲノムが列)と、ゲノムを比較するグループに分けたメタデータファイルの形で読み込んで、基本的な統計を実行する。さらに、グループ間やグループ内でのモジュールやpfamクランのエンリッチメントを判定する。

pathway

KOマトリックスを読み込んで、Cytoscapeで読める代謝ネットワークとメタデータファイルを生成する。

explore

pathwayと似ているが、指定されたパスウェイを生成するのではなく、指定されたクエリ化合物IDから出発し、入力されたKOマトリックスに存在する酵素を考慮して、その化合物を使用する可能性のある反応を探索する。

 

引用

https://github.com/geronimp/enrichM

Comparative genomics using EnrichM. Joel A Boyd Ben J Woodcroft Gene W Tyson. 2019. In preparation.

 

関連