macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

MMseqs2 コマンド其の3 既存のデータベースをダウンロードするmmseqs databasesコマンド

 

MMseqs2には非常に多くの機能があります。今回はmmseqs databasesコマンドを試します。mmseqs databasesを使うと、UniProtやGTDB、NCBI nr/ntなどからMMseqs2のデータベースとしてビルド済みのデータベースをダウンロードして、MMseqs2によるホモロジーサーチや分類群のアサイン(*1)に使用することができます。

 

 インストール

以前の記事を参照

mmseqs -h

# mmseqs -h

MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast, 

parallelized protein sequence searches and clustering of huge protein sequence data sets.

 

Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).

 

MMseqs2 Version: 13.45111

© Martin Steinegger (martin.steinegger@snu.ac.kr)

 

usage: mmseqs <command> [<args>]

 

Easy workflows for plain text input/output

  easy-search       Sensitive homology search

  easy-linsearch    Fast, less sensitive homology search

  easy-cluster      Slower, sensitive clustering

  easy-linclust     Fast linear time cluster, less sensitive clustering

  easy-taxonomy     Taxonomic classification

  easy-rbh          Find reciprocal best hit

 

Main workflows for database input/output

  search            Sensitive homology search

  linsearch         Fast, less sensitive homology search

  map               Map nearly identical sequences

  rbh               Reciprocal best hit search

  linclust          Fast, less sensitive clustering

  cluster           Slower, sensitive clustering

  clusterupdate     Update previous clustering with new sequences

  taxonomy          Taxonomic classification

 

Input database creation

  databases         List and download databases

  createdb          Convert FASTA/Q file(s) to a sequence DB

  createindex       Store precomputed index on disk to reduce search overhead

  createlinindex    Create linsearch index

  convertmsa        Convert Stockholm/PFAM MSA file to a MSA DB

  tsv2db            Convert a TSV file to any DB

  tar2db            Convert content of tar archives to any DB

  msa2profile       Convert a MSA DB to a profile DB

 

Handle databases on storage and memory

  compress          Compress DB entries

  decompress        Decompress DB entries

  rmdb              Remove a DB

  mvdb              Move a DB

  cpdb              Copy a DB

  lndb              Symlink a DB

  unpackdb          Unpack a DB into separate files

  touchdb           Preload DB into memory (page cache)

 

Unite and intersect databases

  createsubdb       Create a subset of a DB from list of DB keys

  concatdbs         Concatenate two DBs, giving new IDs to entries from 2nd DB

  splitdb           Split DB into subsets

  mergedbs          Merge entries from multiple DBs

  subtractdbs       Remove all entries from first DB occurring in second DB by key

 

Format conversion for downstream processing

  convertalis       Convert alignment DB to BLAST-tab, SAM or custom format

  createtsv         Convert result DB to tab-separated flat file

  convert2fasta     Convert sequence DB to FASTA format

  result2flat       Create flat file by adding FASTA headers to DB entries

  createseqfiledb   Create a DB of unaligned FASTA entries

  taxonomyreport    Create a taxonomy report in Kraken or Krona format

 

Sequence manipulation/transformation

  extractorfs       Six-frame extraction of open reading frames

  extractframes     Extract frames from a nucleotide sequence DB

  orftocontig       Write ORF locations in alignment format

  reverseseq        Reverse (without complement) sequences

  translatenucs     Translate nucleotides to proteins

  translateaa       Translate proteins to lexicographically lowest codons

  splitsequence     Split sequences by length

  masksequence      Soft mask sequence DB using tantan

  extractalignedregion Extract aligned sequence region from query

 

Result manipulation 

  swapresults       Transpose prefilter/alignment DB

  result2rbh        Filter a merged result DB to retain only reciprocal best hits

  result2msa        Compute MSA DB from a result DB

  result2dnamsa     Compute MSA DB with out insertions in the query for DNA sequences

  result2stats      Compute statistics for each entry in a DB

  filterresult      Pairwise alignment result filter

  offsetalignment   Offset alignment by ORF start position

  proteinaln2nucl   Transform protein alignments to nucleotide alignments

  result2repseq     Get representative sequences from result DB

  sortresult        Sort a result DB in the same order as the prefilter or align module

  summarizealis     Summarize alignment result to one row (uniq. cov., cov., avg. seq. id.)

  summarizeresult   Extract annotations from alignment DB

 

Taxonomy assignment 

  createtaxdb       Add taxonomic labels to sequence DB

  createbintaxonomy Create binary taxonomy from NCBI input

  addtaxonomy       Add taxonomic labels to result DB

  taxonomyreport    Create a taxonomy report in Kraken or Krona format

  filtertaxdb       Filter taxonomy result database

  filtertaxseqdb    Filter taxonomy sequence database

  aggregatetax      Aggregate multiple taxon labels to a single label

  aggregatetaxweights Aggregate multiple taxon labels to a single label

  lcaalign          Efficient gapped alignment for lca computation

  lca               Compute the lowest common ancestor

  majoritylca       Compute the lowest common ancestor using majority voting

 

Multi-hit search    

  multihitdb        Create sequence DB for multi hit searches

  multihitsearch    Search with a grouped set of sequences against another grouped set

  besthitperset     For each set of sequences compute the best element and update p-value

  combinepvalperset For each set compute the combined p-value

  mergeresultsbyset Merge results from multiple ORFs back to their respective contig

 

Prefiltering        

  prefilter         Double consecutive diagonal k-mer search

  ungappedprefilter Optimal diagonal score search

  kmermatcher       Find bottom-m-hashed k-mer matches within sequence DB

  kmersearch        Find bottom-m-hashed k-mer matches between target and query DB

 

Alignment           

  align             Optimal gapped local alignment

  alignall          Within-result all-vs-all gapped local alignment

  transitivealign   Transfer alignments via transitivity

  rescorediagonal   Compute sequence identity for diagonal

  alignbykmer       Heuristic gapped local k-mer based alignment

 

Clustering          

  clust             Cluster result by Set-Cover/Connected-Component/Greedy-Incremental

  clusthash         Hash-based clustering of equal length sequences

  mergeclusters     Merge multiple cascaded clustering steps

 

Profile databases   

  result2profile    Compute profile DB from a result DB

  msa2result        Convert a MSA DB to a profile DB

  msa2profile       Convert a MSA DB to a profile DB

  profile2pssm      Convert a profile DB to a tab-separated PSSM file

  profile2consensus Extract consensus sequence DB from a profile DB

  profile2repseq    Extract representative sequence DB from a profile DB

  convertprofiledb  Convert a HH-suite HHM DB to a profile DB

 

Profile-profile databases

  enrich            Boost diversity of search result

  result2pp         Merge two profile DBs by shared hits

  profile2cs        Convert a profile DB into a column state sequence DB

  convertca3m       Convert a cA3M DB to a result DB

  expandaln         Expand an alignment result based on another

  expand2profile    Expand an alignment result based on another and create a profile

 

Utility modules to manipulate DBs

  view              Print DB entries given in --id-list to stdout

  apply             Execute given program on each DB entry

  filterdb          DB filtering by given conditions

  swapdb            Transpose DB with integer values in first column

  prefixid          For each entry in a DB prepend the entry key to the entry itself

  suffixid          For each entry in a DB append the entry key to the entry itself

  renamedbkeys      Create a new DB with original keys renamed

 

Special-purpose utilities

  diffseqdbs        Compute diff of two sequence DBs

  summarizetabs     Extract annotations from HHblits BLAST-tab-formatted results

  gff2db            Extract regions from a sequence database based on a GFF3 file

  maskbygff         Mask out sequence regions in a sequence DB by features selected from a GFF3 file

  convertkb         Convert UniProtKB data to a DB

  summarizeheaders  Summarize FASTA headers of result DB

  nrtotaxmapping    Create taxonomy mapping for NR database

  extractdomains    Extract highest scoring alignment regions for each sequence from BLAST-tab file

  countkmer         Count k-mers

 

Bash completion for modules and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".

Include the location of the MMseqs2 binary in your "$PATH" environment variable.

#Version: 13.45111

>mmseqs databases

# mmseqs databases 

usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]

 

  Name                Type      Taxonomy Url                                                           

- UniRef100           Aminoacid      yes https://www.uniprot.org/help/uniref

- UniRef90            Aminoacid      yes https://www.uniprot.org/help/uniref

- UniRef50            Aminoacid      yes https://www.uniprot.org/help/uniref

- UniProtKB           Aminoacid      yes https://www.uniprot.org/help/uniprotkb

- UniProtKB/TrEMBL    Aminoacid      yes https://www.uniprot.org/help/uniprotkb

- UniProtKB/Swiss-Prot Aminoacid      yes https://uniprot.org

- NR                  Aminoacid      yes https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA

- NT                  Nucleotide       - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA

- GTDB                Aminoacid      yes https://gtdb.ecogenomic.org

- PDB                 Aminoacid        - https://www.rcsb.org

- PDB70               Profile          - https://github.com/soedinglab/hh-suite

- Pfam-A.full         Profile          - https://pfam.xfam.org

- Pfam-A.seed         Profile          - https://pfam.xfam.org

- Pfam-B              Profile          - https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released

- CDD                 Profile          - https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

- eggNOG              Profile          - http://eggnog5.embl.de

- dbCAN2              Profile          - http://bcb.unl.edu/dbCAN2

- SILVA               Nucleotide     yes https://www.arb-silva.de

- Resfinder           Nucleotide       - https://cge.cbs.dtu.dk/services/ResFinder

- Kalamari            Nucleotide     yes https://github.com/lskatz/Kalamari

options:                   

 --compressed INT   Write compressed output [0]

 --threads INT      Number of CPU-cores used (all by default) [12]

 -v INT             Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

 

references:

 - Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

 - Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, 2020.11.27.401018 (2020)

 

Show an extended list of options by calling 'mmseqs databases -h'.

#mmseqs databases -hで詳細なヘルプ

 

Taxonomy欄でyesとなっているデータベースは、MMseqs2のデータベースであるseqTaxDBに完全に対応しており、各データベースのキーとそのtaxon IDのマッピングファイルが含まれている。つまり、対応したデータベースをダウンロードすれば、そのままクエリの配列に分類学のラベルを割り当てることができるために使用できる。MMseqs2のseqTaxDBに対して配列を検索した結果のデータベースは、taxonomyResultと呼ばれる(マニュアルより)。

 

 

実行方法

1、データベースのダウンロード

ここではseqTaxDBに対応しており、データベースサイズが小さいSILVAを使う。

mkdir tmp
mmseqs databases SILVA SILVA_database tmp

出力

f:id:kazumaxneo:20210915120557p:plain

 

 

2、クエリの配列(ここでは塩基配列)をデータベースに変換する。

mmseqs createdb inout.fasta queryDB

読み込みを高速化するために、追加でtargetDB のインデックスファイルを計算することもできる(mmseqs createindex queryDB tmp search-type 3)。

 

3、一時ディレクトリの作成(高いI/Oが必要なので大規模な検索ではSSDなどを使う。ただし、tmpには十分な空き容量がないといけない)。

ここではカレントに作成する。

mkdir tmp

 

4、ここでは塩基配列のクエリを塩基配列のデータベースに対して検索。

mmseqs search queryDB SILVA_database resultDB tmp --serach-type 3
  • -s    sensitivity: 1.0 faster; 4.0 fast default; 7.5 sensitive (default 5.7)

 

5、結果のデータベースをBLASTタブ形式のファイルに変換する。

mmseqs convertalis queryDB SILVA_database resultDB result.m8

 

引用

*1

Fast and sensitive taxonomic assignment to metagenomic contigs 
M Mirdita, M Steinegger, F Breitwieser, J Söding, E Levy Karin
Bioinformatics, Published: 18 March 2021 

 

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Steinegger M, Söding J

Nat Biotechnol. 2017 Nov;35(11):1026-1028

 

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.
Hauser M, Steinegger M, Söding J

Bioinformatics. 2016 May 1;32(9):1323-30.