MMseqs2 コマンド其の3 既存のデータベースをダウンロードするmmseqs databasesコマンド

MMseqs2には非常に多くの機能があります。今回はmmseqs databasesコマンドを試します。mmseqs databasesを使うと、UniProtやGTDB、NCBI nr/ntなどからMMseqs2のデータベースとしてビルド済みのデータベースをダウンロードして、MMseqs2によるホモロジーサーチや分類群のアサイン（*1）に使用することができます。

インストール

以前の記事を参照

> mmseqs -h

# mmseqs -h

MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast,

parallelized protein sequence searches and clustering of huge protein sequence data sets.

Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).

MMseqs2 Version: 13.45111

usage: mmseqs <command> [<args>]

Easy workflows for plain text input/output

easy-search Sensitive homology search

easy-linsearch Fast, less sensitive homology search

easy-cluster Slower, sensitive clustering

easy-linclust Fast linear time cluster, less sensitive clustering

easy-taxonomy Taxonomic classification

easy-rbh Find reciprocal best hit

Main workflows for database input/output

search Sensitive homology search

linsearch Fast, less sensitive homology search

map Map nearly identical sequences

rbh Reciprocal best hit search

linclust Fast, less sensitive clustering

cluster Slower, sensitive clustering

clusterupdate Update previous clustering with new sequences

taxonomy Taxonomic classification

Input database creation

databases List and download databases

createdb Convert FASTA/Q file(s) to a sequence DB

createindex Store precomputed index on disk to reduce search overhead

createlinindex Create linsearch index

convertmsa Convert Stockholm/PFAM MSA file to a MSA DB

tsv2db Convert a TSV file to any DB

tar2db Convert content of tar archives to any DB

msa2profile Convert a MSA DB to a profile DB

Handle databases on storage and memory

compress Compress DB entries

decompress Decompress DB entries

rmdb Remove a DB

mvdb Move a DB

cpdb Copy a DB

lndb Symlink a DB

unpackdb Unpack a DB into separate files

touchdb Preload DB into memory (page cache)

Unite and intersect databases

createsubdb Create a subset of a DB from list of DB keys

concatdbs Concatenate two DBs, giving new IDs to entries from 2nd DB

splitdb Split DB into subsets

mergedbs Merge entries from multiple DBs

subtractdbs Remove all entries from first DB occurring in second DB by key

Format conversion for downstream processing

convertalis Convert alignment DB to BLAST-tab, SAM or custom format

createtsv Convert result DB to tab-separated flat file

convert2fasta Convert sequence DB to FASTA format

result2flat Create flat file by adding FASTA headers to DB entries

createseqfiledb Create a DB of unaligned FASTA entries

taxonomyreport Create a taxonomy report in Kraken or Krona format

Sequence manipulation/transformation

extractorfs Six-frame extraction of open reading frames

extractframes Extract frames from a nucleotide sequence DB

orftocontig Write ORF locations in alignment format

reverseseq Reverse (without complement) sequences

translatenucs Translate nucleotides to proteins

translateaa Translate proteins to lexicographically lowest codons

splitsequence Split sequences by length

masksequence Soft mask sequence DB using tantan

extractalignedregion Extract aligned sequence region from query

Result manipulation

swapresults Transpose prefilter/alignment DB

result2rbh Filter a merged result DB to retain only reciprocal best hits

result2msa Compute MSA DB from a result DB

result2dnamsa Compute MSA DB with out insertions in the query for DNA sequences

result2stats Compute statistics for each entry in a DB

filterresult Pairwise alignment result filter

offsetalignment Offset alignment by ORF start position

proteinaln2nucl Transform protein alignments to nucleotide alignments

result2repseq Get representative sequences from result DB

sortresult Sort a result DB in the same order as the prefilter or align module

summarizealis Summarize alignment result to one row (uniq. cov., cov., avg. seq. id.)

summarizeresult Extract annotations from alignment DB

Taxonomy assignment

createtaxdb Add taxonomic labels to sequence DB

createbintaxonomy Create binary taxonomy from NCBI input

addtaxonomy Add taxonomic labels to result DB

taxonomyreport Create a taxonomy report in Kraken or Krona format

filtertaxdb Filter taxonomy result database

filtertaxseqdb Filter taxonomy sequence database

aggregatetax Aggregate multiple taxon labels to a single label

aggregatetaxweights Aggregate multiple taxon labels to a single label

lcaalign Efficient gapped alignment for lca computation

lca Compute the lowest common ancestor

majoritylca Compute the lowest common ancestor using majority voting

Multi-hit search

multihitdb Create sequence DB for multi hit searches

multihitsearch Search with a grouped set of sequences against another grouped set

besthitperset For each set of sequences compute the best element and update p-value

combinepvalperset For each set compute the combined p-value

mergeresultsbyset Merge results from multiple ORFs back to their respective contig

Prefiltering

prefilter Double consecutive diagonal k-mer search

ungappedprefilter Optimal diagonal score search

kmermatcher Find bottom-m-hashed k-mer matches within sequence DB

kmersearch Find bottom-m-hashed k-mer matches between target and query DB

Alignment

align Optimal gapped local alignment

alignall Within-result all-vs-all gapped local alignment

transitivealign Transfer alignments via transitivity

rescorediagonal Compute sequence identity for diagonal

alignbykmer Heuristic gapped local k-mer based alignment

Clustering

clust Cluster result by Set-Cover/Connected-Component/Greedy-Incremental

clusthash Hash-based clustering of equal length sequences

mergeclusters Merge multiple cascaded clustering steps

Profile databases

result2profile Compute profile DB from a result DB

msa2result Convert a MSA DB to a profile DB

msa2profile Convert a MSA DB to a profile DB

profile2pssm Convert a profile DB to a tab-separated PSSM file

profile2consensus Extract consensus sequence DB from a profile DB

profile2repseq Extract representative sequence DB from a profile DB

convertprofiledb Convert a HH-suite HHM DB to a profile DB

Profile-profile databases

enrich Boost diversity of search result

result2pp Merge two profile DBs by shared hits

profile2cs Convert a profile DB into a column state sequence DB

convertca3m Convert a cA3M DB to a result DB

expandaln Expand an alignment result based on another

expand2profile Expand an alignment result based on another and create a profile

Utility modules to manipulate DBs

view Print DB entries given in --id-list to stdout

apply Execute given program on each DB entry

filterdb DB filtering by given conditions

swapdb Transpose DB with integer values in first column

prefixid For each entry in a DB prepend the entry key to the entry itself

suffixid For each entry in a DB append the entry key to the entry itself

renamedbkeys Create a new DB with original keys renamed

Special-purpose utilities

diffseqdbs Compute diff of two sequence DBs

summarizetabs Extract annotations from HHblits BLAST-tab-formatted results

gff2db Extract regions from a sequence database based on a GFF3 file

maskbygff Mask out sequence regions in a sequence DB by features selected from a GFF3 file

convertkb Convert UniProtKB data to a DB

summarizeheaders Summarize FASTA headers of result DB

nrtotaxmapping Create taxonomy mapping for NR database

extractdomains Extract highest scoring alignment regions for each sequence from BLAST-tab file

countkmer Count k-mers

Bash completion for modules and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".

Include the location of the MMseqs2 binary in your "$PATH" environment variable.

#Version: 13.45111

>mmseqs databases

# mmseqs databases

usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]

Name Type Taxonomy Url

- UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref

- UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref

- UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref

- UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb

- UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb

- UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org

- NR Aminoacid yes https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA

- NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA

- GTDB Aminoacid yes https://gtdb.ecogenomic.org

- PDB Aminoacid - https://www.rcsb.org

- PDB70 Profile - https://github.com/soedinglab/hh-suite

- Pfam-A.full Profile - https://pfam.xfam.org

- Pfam-A.seed Profile - https://pfam.xfam.org

- Pfam-B Profile - https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released

- CDD Profile - https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

- eggNOG Profile - http://eggnog5.embl.de

- dbCAN2 Profile - http://bcb.unl.edu/dbCAN2

- SILVA Nucleotide yes https://www.arb-silva.de

- Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder

- Kalamari Nucleotide yes https://github.com/lskatz/Kalamari

options:

--compressed INT Write compressed output [0]

--threads INT Number of CPU-cores used (all by default) [12]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

- Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, 2020.11.27.401018 (2020)

Show an extended list of options by calling 'mmseqs databases -h'.

#mmseqs databases -hで詳細なヘルプ

Taxonomy欄でyesとなっているデータベースは、MMseqs2のデータベースであるseqTaxDBに完全に対応しており、各データベースのキーとそのtaxon IDのマッピングファイルが含まれている。つまり、対応したデータベースをダウンロードすれば、そのままクエリの配列に分類学のラベルを割り当てることができるために使用できる。MMseqs2のseqTaxDBに対して配列を検索した結果のデータベースは、taxonomyResultと呼ばれる（マニュアルより）。

実行方法

１、データベースのダウンロード

ここではseqTaxDBに対応しており、データベースサイズが小さいSILVAを使う。

mkdir tmp
mmseqs databases SILVA SILVA_database tmp

出力

f:id:kazumaxneo:20210915120557p:plain

対応DB

$ mmseqs databases
usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]

Uniprotを初め多くのデータベースのダウンロードに対応｡Uniref、NR/NT、PDB、eggNOG、pfam、CDDなど重宝すると思われる。

２、クエリの配列（ここでは塩基配列）をデータベースに変換する。

mmseqs createdb inout.fasta queryDB

読み込みを高速化するために、追加でtargetDB のインデックスファイルを計算することもできる（mmseqs createindex queryDB tmp search-type 3）。

３、一時ディレクトリの作成（高いI/Oが必要なので大規模な検索ではSSDなどを使う。ただし、tmpには十分な空き容量がないといけない）。

ここではカレントに作成する。

mkdir tmp

４、ここでは塩基配列のクエリを塩基配列のデータベースに対して検索。

mmseqs search queryDB SILVA_database resultDB tmp --serach-type 3

-s sensitivity: 1.0 faster; 4.0 fast default; 7.5 sensitive (default 5.7)

５、結果のデータベースをBLASTタブ形式のファイルに変換する。

mmseqs convertalis queryDB SILVA_database resultDB result.m8

引用

Fast and sensitive taxonomic assignment to metagenomic contigs
M Mirdita, M Steinegger, F Breitwieser, J Söding, E Levy Karin
Bioinformatics, Published: 18 March 2021

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Steinegger M, Söding J

Nat Biotechnol. 2017 Nov;35(11):1026-1028

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.
Hauser M, Steinegger M, Söding J

Bioinformatics. 2016 May 1;32(9):1323-30.