macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

MMseqs2 コマンド其の4 分類群をアサインする mmseqs taxonomyコマンド

 

今年出た論文(*1)より

MMseqs2 taxonomyは、メタゲノムのコンティグに分類学上のラベルを付与する新しいツールである。各コンティグから可能性のある全てのタンパク質断片を抽出し、分類学的なアノテーションに貢献できるものを素早く取り出し、それらにロバストなラベルを付与し、weighted votingによってコンティグの分類学的な同一性を決定する。そのフラグメント抽出ステップは、生命の全てのドメインの分析に適している。MMseqs2 taxonomyは、最先端のツールと比較して2~18倍の速度で動作し、分類学的参照データベースの作成と操作、分類学的割り当ての報告と視覚化のための新しいモジュールも含まれている。
MMseqs2 taxonomyは、フリーのMMseqs2オープンソースソフトウェアパッケージの一部であり、LinuxmacOSWindowsで利用できる。

 

user guide

https://mmseqs.com/latest/userguide.pdf

tutorial

https://github.com/soedinglab/MMseqs2/wiki/Tutorials

 

 

f:id:kazumaxneo:20210916002624p:plain

MMseqs2 taxnomyのLCA (wiki) 推定アプローチ (2bLCA)。マニュアルより転載。

 

 

 インストール

以前の記事を参照

> mmseqs taxonomy -h

# mmseqs taxonomy -h

usage: mmseqs taxonomy <i:queryDB> <i:targetDB> <o:taxaDB> <tmpDir> [options]

 By Milot Mirdita <milot@mirdita.de> & Martin Steinegger <martin.steinegger@snu.ac.kr> & Eli Levy Karin <eli.levy.karin@gmail.com>

options: prefilter:                      

 --comp-bias-corr INT             Correct for locally biased amino acid composition (range 0-1) [1]

 --add-self-matches BOOL          Artificially add entries of queries with themselves (for clustering) [0]

 --seed-sub-mat TWIN              Substitution matrix file for k-mer generation [nucl:nucleotide.out,aa:VTML80.out]

 -s FLOAT                         Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [2.000]

 -k INT                           k-mer length (0: automatically set to optimum) [0]

 --k-score INT                    k-mer threshold for generating similar k-mer lists [2147483647]

 --alph-size TWIN                 Alphabet size (range 2-21) [nucl:5,aa:21]

 --max-seqs INT                   Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [300]

 --split INT                      Split input into N equally distributed chunks. 0: set the best split automatically [0]

 --split-mode INT                 0: split target db; 1: split query db; 2: auto, depending on main memory [2]

 --split-memory-limit BYTE        Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

 --diag-score BOOL                Use ungapped diagonal scoring during prefilter [1]

 --exact-kmer-matching INT        Extract only exact k-mers for matching (range 0-1) [0]

 --mask INT                       Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [1]

 --mask-lower-case INT            Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

 --min-ungapped-score INT         Accept only matches with ungapped alignment score above threshold [15]

 --spaced-kmer-mode INT           0: use consecutive positions in k-mers; 1: use spaced k-mers [1]

 --spaced-kmer-pattern STR        User-specified spaced k-mer pattern

 --local-tmp STR                  Path where some of the temporary files will be created

 --disk-space-limit BYTE          Set max disk space to use for reverse profile searches. E.g. 800B, 5K, 10M, 1G. Default (0) to all available disk space in the temp folder [0]

align:                          

 -a BOOL                          Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

 --alignment-mode INT             How to compute the alignment:

                                  0: automatic

                                  1: only score and end_pos

                                  2: also start_pos and cov

                                  3: also seq.id

                                  4: only ungapped alignment [1]

 --alignment-output-mode INT      How to compute the alignment:

                                  0: automatic

                                  1: only score and end_pos

                                  2: also start_pos and cov

                                  3: also seq.id

                                  4: only ungapped alignment

                                  5: score only (output) cluster format [0]

 --wrapped-scoring BOOL           Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]

 -e DOUBLE                        List matches below this E-value (range 0.0-inf) [1.000E+00]

 --min-seq-id FLOAT               List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]

 --min-aln-len INT                Minimum alignment length (range 0-INT_MAX) [0]

 --seq-id-mode INT                0: alignment length 1: shorter, 2: longer sequence [0]

 --alt-ali INT                    Show up to this many alternative alignments [0]

 -c FLOAT                         List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]

 --cov-mode INT                   0: coverage of query and target

                                  1: coverage of target

                                  2: coverage of query

                                  3: target seq. length has to be at least x% of query length

                                  4: query seq. length has to be at least x% of target length

                                  5: short seq. needs to be at least x% of the other seq. length [0]

 --max-rejected INT               Maximum rejected alignments before alignment calculation for a query is stopped [5]

 --max-accept INT                 Maximum accepted alignments before alignment calculation for a query is stopped [30]

 --score-bias FLOAT               Score bias when computing SW alignment (in bits) [0.000]

 --realign BOOL                   Compute more conservative, shorter alignments (scores and E-values not changed) [0]

 --realign-score-bias FLOAT       Additional bias when computing realignment [-0.200]

 --realign-max-seqs INT           Maximum number of results to return in realignment [2147483647]

 --gap-open TWIN                  Gap open cost [nucl:5,aa:11]

 --gap-extend TWIN                Gap extension cost [nucl:2,aa:1]

 --zdrop INT                      Maximal allowed difference between score values before alignment is truncated  (nucleotide alignment only) [40]

 --exhaustive-search-filter INT   Filter result during search: 0: do not filter, 1: filter [0]

profile:                        

 --pca FLOAT                      Pseudo count admixture strength [1.000]

 --pcb FLOAT                      Pseudo counts: Neff at half of maximum admixture (range 0.0-inf) [1.500]

 --mask-profile INT               Mask query sequence of profile using tantan [0,1] [1]

 --e-profile DOUBLE               Include sequences matches with < E-value thr. into the profile (>=0.0) [1.000E-03]

 --wg BOOL                        Use global sequence weighting for profile calculation [0]

 --filter-msa INT                 Filter msa: 0: do not filter, 1: filter [1]

 --max-seq-id FLOAT               Reduce redundancy of output MSA using max. pairwise sequence identity [0.0,1.0] [0.900]

 --qid FLOAT                      Reduce diversity of output MSAs using min.seq. identity with query sequences [0.0,1.0] [0.000]

 --qsc FLOAT                      Reduce diversity of output MSAs using min. score per aligned residue with query sequences [-50.0,100.0] [-20.000]

 --cov FLOAT                      Filter output MSAs using min. fraction of query residues covered by matched sequences [0.0,1.0] [0.000]

 --diff INT                       Filter MSAs by selecting most diverse set of sequences, keeping at least this many seqs in each MSA block of length 50 [1000]

 --num-iterations INT             Number of iterative profile search iterations [1]

 --exhaustive-search BOOL         For bigger profile DB, run iteratively the search by greedily swapping the search results [0]

 --lca-search BOOL                Efficient search for LCA candidates [0]

misc:                           

 --orf-filter INT                 Prefilter query ORFs with non-selective search

                                  Only used during nucleotide-vs-protein classification

                                  NOTE: Consider disabling when classifying short reads [1]

 --orf-filter-e DOUBLE            E-value threshold used for query ORF prefiltering [1.000E+02]

 --orf-filter-s FLOAT             Sensitivity used for query ORF prefiltering [2.000]

 --lca-mode INT                   LCA Mode 1: single search LCA , 2/3: approximate 2bLCA, 4: top hit [3]

 --tax-output-mode INT            0: output LCA, 1: output alignment 2: output both [0]

 --majority FLOAT                 minimal fraction of agreement among taxonomically assigned sequences of a set [0.500]

 --vote-mode INT                  Mode of assigning weights to compute majority. 0: uniform, 1: minus log E-value, 2: score [1]

 --lca-ranks STR                  Add column with specified ranks (',' separated)

 --tax-lineage INT                0: don't show, 1: add all lineage names, 2: add all lineage taxids [0]

 --blacklist STR                  Comma separated list of ignored taxa in LCA computation [12908:unclassified sequences,28384:other sequences]

 --rescore-mode INT               Rescore diagonals with:

                                  0: Hamming distance

                                  1: local alignment (score only)

                                  2: local alignment

                                  3: global alignment

                                  4: longest alignment fulfilling window quality criterion [0]

 --allow-deletion BOOL            Allow deletions in a MSA [0]

 --min-length INT                 Minimum codon number in open reading frames [30]

 --max-length INT                 Maximum codon number in open reading frames [32734]

 --max-gaps INT                   Maximum number of codons with gaps or unknown residues before an open reading frame is rejected [2147483647]

 --contig-start-mode INT          Contig start can be 0: incomplete, 1: complete, 2: both [2]

 --contig-end-mode INT            Contig end can be 0: incomplete, 1: complete, 2: both [2]

 --orf-start-mode INT             Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle) [1]

 --forward-frames STR             Comma-separated list of frames on the forward strand to be extracted [1,2,3]

 --reverse-frames STR             Comma-separated list of frames on the reverse strand to be extracted [1,2,3]

 --translation-table INT          1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE

                                  9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL

                                  15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL

                                  23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA

                                   29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA [1]

 --translate INT                  Translate ORF to amino acid [0]

 --use-all-table-starts BOOL      Use all alternatives for a start codon in the genetic table, if false - only ATG (AUG) [0]

 --id-offset INT                  Numeric ids in index file are offset by this value [0]

 --add-orf-stop BOOL              Add stop codon '*' at complete start and end [0]

 --sequence-overlap INT           Overlap between sequences [0]

 --sequence-split-mode INT        Sequence split mode 0: copy data, 1: soft link data and write new index, [1]

 --headers-split-mode INT         Header split mode: 0: split position, 1: original header [0]

 --search-type INT                Search type 0: auto 1: amino acid, 2: translated, 3: nucleotide, 4: translated nucleotide alignment [0]

 --start-sens FLOAT               Start sensitivity [4.000]

 --sens-steps INT                 Number of search steps performed from --start-sens to -s [1]

common:                         

 --compressed INT                 Write compressed output [0]

 --threads INT                    Number of CPU-cores used (all by default) [12]

 -v INT                           Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

 --sub-mat TWIN                   Substitution matrix file [nucl:nucleotide.out,aa:blosum62.out]

 --max-seq-len INT                Maximum sequence length [65535]

 --db-load-mode INT               Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

 --mpi-runner STR                 Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")

 --force-reuse BOOL               Reuse tmp filse in tmp/latest folder ignoring parameters and version changes [0]

 --remove-tmp-files BOOL          Delete temporary files [0]

expert:                         

 --filter-hits BOOL               Filter hits by seq.id. and coverage [0]

 --sort-results INT               Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]

 --create-lookup INT              Create database lookup file (can be very large) [0]

 --chain-alignments INT           Chain overlapping alignments [0]

 --merge-query INT                Combine ORFs/split sequences to a single entry [1]

 --strand INT                     Strand selection only works for DNA/DNA search 0: reverse, 1: forward, 2: both [1]

 

examples:

 # Download a sequence database with taxonomy information

 mmseqs databases UniProtKB/Swiss-Prot swissprotDB tmp

 

 # Assign taxonomy based on 2bLCA

 mmseqs taxonomy queryDB swissprotDB result tmp

 

 # Assign taxonomy based on top hit

 mmseqs taxonomy queryDB swissprotDB result tmp --lca-mode 4

 

 # Assign taxonomy without ORF prefilter

 # Classifies higher percentage for short nucleotide input (e.g. short reads) at the cost of speed

 mmseqs taxonomy queryNuclDB swissprotDB result tmp --orf-filter 0

 

 # Create a Krona report

 mmseqs taxonomyreport swissprotDB result report.html --report-mode 1

 

references:

 - Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

 - Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, 2020.11.27.401018 (2020)

 

 

 

データベースの準備

前回の記事に書きましたが、mmseqs databaseコマンドを使うことでビルド済みのデータベースをダウンロードすることができます。ここではメタゲノムアセンブリのコンティグにラベルを当てることを想定して、Unirefデータベースをダウンロードします(紹介した時の記事)。

mkdir tmp
mmseqs databases UniRef90 UniRef90_DB tmp

(UnirefはseqTaxDBに対応している)

 

mmseqs taxonomyのラン

1、クエリの配列(ここではメタゲノムのアセンブリ)をデータベースに変換する。

mmseqs createdb metagenome.fasta queryDB

 

2、一時ディレクトリtmpの作成(高いI/Oが必要なので大規模な検索ではSSDなどを使う。ただし、tmpには十分な空き容量がないといけない)。

 

3、 mmseqs taxonomyの実行

1で作成したクエリ配列のデータベース、準備したデータベース、出力prefix、作業ディレクトリの順番に指定する。

mmseqs taxonomy queryDB UniRef90_DB output_prefix tmp --majority 0.5 --lca-mode 3 --vote-mode 1 --tax-lineage 2 --orf-filter 1
  • --tax-lineage INT   0: don't show, 1: add all lineage names, 2: add all lineage taxids [0]
  •  --majority FLOAT   minimal fraction of agreement among taxonomically assigned sequences of a set [0.500]
  • --vote-mode INT   Mode of assigning weights to compute majority. 0: uniform, 1: minus log E-value, 2: score [1]
  • --lca-mode INT   LCA Mode 1: single search LCA , 2/3: approximate 2bLCA, 4: top hit [3]
  • --orf-filter INT   Prefilter query ORFs with non-selective search. Only used during nucleotide-vs-protein classification.  NOTE: Consider disabling when classifying short reads [1]

6フレームで含まれるタンパク質断片を抽出し、2bLCA法で検索する(--lca-mode 3)。次に、割り当てられたすべてのフラグメントの中から投票を行い、-log(E-value)の重みが50%以上の支持を得た(--majority 0.5)最も具体的な分類ラベルを選択する (--vote-mode 1)。パラメータ--tax-lineage 2は、NCBIのtaxidsとしての完全な系統情報を出力することを示している。

ラン中はデフォルトで利用可能なすべてのCPUが使用される。その間、他の計算は出来ないほど効率的にコアが使用されるので注意する。計算が終わると、output_prefixのファイルが複数生成される。

 

4、出力をTSVに変換する。

1で作成したクエリ配列のデータベース、3の出力prefix、出力タブ区切りファイルの順番に指定する。

mmseqs createtsv queryDB output_prefix result.tsv

f:id:kazumaxneo:20210915214301p:plain

 

easy-taxonomyコマンドを使うとより簡単に実行できます。

mmseqs easy-taxonomy QUERY.fasta targetDB output_prefix  tmp

 

引用

Fast and sensitive taxonomic assignment to metagenomic contigs 
M Mirdita, M Steinegger, F Breitwieser, J Söding, E Levy Karin
Bioinformatics, Published: 18 March 2021 

 

関連