HTS (NGS) 関連のインフォマティクス情報についてまとめています。

MMseqs2 コマンド其の3 既存のデータベースをダウンロードするmmseqs databasesコマンド


MMseqs2には非常に多くの機能があります。今回はmmseqs databasesコマンドを試します。mmseqs databasesを使うと、UniProtやGTDB、NCBI nr/ntなどからMMseqs2のデータベースとしてビルド済みのデータベースをダウンロードして、MMseqs2によるホモロジーサーチや分類群のアサイン(*1)に使用することができます。




mmseqs -h

# mmseqs -h

MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast, 

parallelized protein sequence searches and clustering of huge protein sequence data sets.


Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).


MMseqs2 Version: 13.45111

© Martin Steinegger (


usage: mmseqs <command> [<args>]


Easy workflows for plain text input/output

  easy-search       Sensitive homology search

  easy-linsearch    Fast, less sensitive homology search

  easy-cluster      Slower, sensitive clustering

  easy-linclust     Fast linear time cluster, less sensitive clustering

  easy-taxonomy     Taxonomic classification

  easy-rbh          Find reciprocal best hit


Main workflows for database input/output

  search            Sensitive homology search

  linsearch         Fast, less sensitive homology search

  map               Map nearly identical sequences

  rbh               Reciprocal best hit search

  linclust          Fast, less sensitive clustering

  cluster           Slower, sensitive clustering

  clusterupdate     Update previous clustering with new sequences

  taxonomy          Taxonomic classification


Input database creation

  databases         List and download databases

  createdb          Convert FASTA/Q file(s) to a sequence DB

  createindex       Store precomputed index on disk to reduce search overhead

  createlinindex    Create linsearch index

  convertmsa        Convert Stockholm/PFAM MSA file to a MSA DB

  tsv2db            Convert a TSV file to any DB

  tar2db            Convert content of tar archives to any DB

  msa2profile       Convert a MSA DB to a profile DB


Handle databases on storage and memory

  compress          Compress DB entries

  decompress        Decompress DB entries

  rmdb              Remove a DB

  mvdb              Move a DB

  cpdb              Copy a DB

  lndb              Symlink a DB

  unpackdb          Unpack a DB into separate files

  touchdb           Preload DB into memory (page cache)


Unite and intersect databases

  createsubdb       Create a subset of a DB from list of DB keys

  concatdbs         Concatenate two DBs, giving new IDs to entries from 2nd DB

  splitdb           Split DB into subsets

  mergedbs          Merge entries from multiple DBs

  subtractdbs       Remove all entries from first DB occurring in second DB by key


Format conversion for downstream processing

  convertalis       Convert alignment DB to BLAST-tab, SAM or custom format

  createtsv         Convert result DB to tab-separated flat file

  convert2fasta     Convert sequence DB to FASTA format

  result2flat       Create flat file by adding FASTA headers to DB entries

  createseqfiledb   Create a DB of unaligned FASTA entries

  taxonomyreport    Create a taxonomy report in Kraken or Krona format


Sequence manipulation/transformation

  extractorfs       Six-frame extraction of open reading frames

  extractframes     Extract frames from a nucleotide sequence DB

  orftocontig       Write ORF locations in alignment format

  reverseseq        Reverse (without complement) sequences

  translatenucs     Translate nucleotides to proteins

  translateaa       Translate proteins to lexicographically lowest codons

  splitsequence     Split sequences by length

  masksequence      Soft mask sequence DB using tantan

  extractalignedregion Extract aligned sequence region from query


Result manipulation 

  swapresults       Transpose prefilter/alignment DB

  result2rbh        Filter a merged result DB to retain only reciprocal best hits

  result2msa        Compute MSA DB from a result DB

  result2dnamsa     Compute MSA DB with out insertions in the query for DNA sequences

  result2stats      Compute statistics for each entry in a DB

  filterresult      Pairwise alignment result filter

  offsetalignment   Offset alignment by ORF start position

  proteinaln2nucl   Transform protein alignments to nucleotide alignments

  result2repseq     Get representative sequences from result DB

  sortresult        Sort a result DB in the same order as the prefilter or align module

  summarizealis     Summarize alignment result to one row (uniq. cov., cov., avg. seq. id.)

  summarizeresult   Extract annotations from alignment DB


Taxonomy assignment 

  createtaxdb       Add taxonomic labels to sequence DB

  createbintaxonomy Create binary taxonomy from NCBI input

  addtaxonomy       Add taxonomic labels to result DB

  taxonomyreport    Create a taxonomy report in Kraken or Krona format

  filtertaxdb       Filter taxonomy result database

  filtertaxseqdb    Filter taxonomy sequence database

  aggregatetax      Aggregate multiple taxon labels to a single label

  aggregatetaxweights Aggregate multiple taxon labels to a single label

  lcaalign          Efficient gapped alignment for lca computation

  lca               Compute the lowest common ancestor

  majoritylca       Compute the lowest common ancestor using majority voting


Multi-hit search    

  multihitdb        Create sequence DB for multi hit searches

  multihitsearch    Search with a grouped set of sequences against another grouped set

  besthitperset     For each set of sequences compute the best element and update p-value

  combinepvalperset For each set compute the combined p-value

  mergeresultsbyset Merge results from multiple ORFs back to their respective contig



  prefilter         Double consecutive diagonal k-mer search

  ungappedprefilter Optimal diagonal score search

  kmermatcher       Find bottom-m-hashed k-mer matches within sequence DB

  kmersearch        Find bottom-m-hashed k-mer matches between target and query DB



  align             Optimal gapped local alignment

  alignall          Within-result all-vs-all gapped local alignment

  transitivealign   Transfer alignments via transitivity

  rescorediagonal   Compute sequence identity for diagonal

  alignbykmer       Heuristic gapped local k-mer based alignment



  clust             Cluster result by Set-Cover/Connected-Component/Greedy-Incremental

  clusthash         Hash-based clustering of equal length sequences

  mergeclusters     Merge multiple cascaded clustering steps


Profile databases   

  result2profile    Compute profile DB from a result DB

  msa2result        Convert a MSA DB to a profile DB

  msa2profile       Convert a MSA DB to a profile DB

  profile2pssm      Convert a profile DB to a tab-separated PSSM file

  profile2consensus Extract consensus sequence DB from a profile DB

  profile2repseq    Extract representative sequence DB from a profile DB

  convertprofiledb  Convert a HH-suite HHM DB to a profile DB


Profile-profile databases

  enrich            Boost diversity of search result

  result2pp         Merge two profile DBs by shared hits

  profile2cs        Convert a profile DB into a column state sequence DB

  convertca3m       Convert a cA3M DB to a result DB

  expandaln         Expand an alignment result based on another

  expand2profile    Expand an alignment result based on another and create a profile


Utility modules to manipulate DBs

  view              Print DB entries given in --id-list to stdout

  apply             Execute given program on each DB entry

  filterdb          DB filtering by given conditions

  swapdb            Transpose DB with integer values in first column

  prefixid          For each entry in a DB prepend the entry key to the entry itself

  suffixid          For each entry in a DB append the entry key to the entry itself

  renamedbkeys      Create a new DB with original keys renamed


Special-purpose utilities

  diffseqdbs        Compute diff of two sequence DBs

  summarizetabs     Extract annotations from HHblits BLAST-tab-formatted results

  gff2db            Extract regions from a sequence database based on a GFF3 file

  maskbygff         Mask out sequence regions in a sequence DB by features selected from a GFF3 file

  convertkb         Convert UniProtKB data to a DB

  summarizeheaders  Summarize FASTA headers of result DB

  nrtotaxmapping    Create taxonomy mapping for NR database

  extractdomains    Extract highest scoring alignment regions for each sequence from BLAST-tab file

  countkmer         Count k-mers


Bash completion for modules and parameters can be installed by adding "source MMSEQS_HOME/util/" to your "$HOME/.bash_profile".

Include the location of the MMseqs2 binary in your "$PATH" environment variable.

#Version: 13.45111

>mmseqs databases

# mmseqs databases 

usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]


  Name                Type      Taxonomy Url                                                           

- UniRef100           Aminoacid      yes

- UniRef90            Aminoacid      yes

- UniRef50            Aminoacid      yes

- UniProtKB           Aminoacid      yes

- UniProtKB/TrEMBL    Aminoacid      yes

- UniProtKB/Swiss-Prot Aminoacid      yes

- NR                  Aminoacid      yes

- NT                  Nucleotide       -

- GTDB                Aminoacid      yes

- PDB                 Aminoacid        -

- PDB70               Profile          -

- Pfam-A.full         Profile          -

- Pfam-A.seed         Profile          -

- Pfam-B              Profile          -

- CDD                 Profile          -

- eggNOG              Profile          -

- dbCAN2              Profile          -

- SILVA               Nucleotide     yes

- Resfinder           Nucleotide       -

- Kalamari            Nucleotide     yes


 --compressed INT   Write compressed output [0]

 --threads INT      Number of CPU-cores used (all by default) [12]

 -v INT             Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]



 - Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

 - Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, 2020.11.27.401018 (2020)


Show an extended list of options by calling 'mmseqs databases -h'.

#mmseqs databases -hで詳細なヘルプ


Taxonomy欄でyesとなっているデータベースは、MMseqs2のデータベースであるseqTaxDBに完全に対応しており、各データベースのキーとそのtaxon IDのマッピングファイルが含まれている。つまり、対応したデータベースをダウンロードすれば、そのままクエリの配列に分類学のラベルを割り当てることができるために使用できる。MMseqs2のseqTaxDBに対して配列を検索した結果のデータベースは、taxonomyResultと呼ばれる(マニュアルより)。






mkdir tmp
mmseqs databases SILVA SILVA_database tmp






mmseqs createdb inout.fasta queryDB

読み込みを高速化するために、追加でtargetDB のインデックスファイルを計算することもできる(mmseqs createindex queryDB tmp search-type 3)。




mkdir tmp



mmseqs search queryDB SILVA_database resultDB tmp --serach-type 3
  • -s    sensitivity: 1.0 faster; 4.0 fast default; 7.5 sensitive (default 5.7)



mmseqs convertalis queryDB SILVA_database resultDB result.m8




Fast and sensitive taxonomic assignment to metagenomic contigs 
M Mirdita, M Steinegger, F Breitwieser, J Söding, E Levy Karin
Bioinformatics, Published: 18 March 2021 


MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Steinegger M, Söding J

Nat Biotechnol. 2017 Nov;35(11):1026-1028


MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.
Hauser M, Steinegger M, Söding J

Bioinformatics. 2016 May 1;32(9):1323-30.