MMseqs2には非常に多くの機能があります。今回はmmseqs databasesコマンドを試します。mmseqs databasesを使うと、UniProtやGTDB、NCBI nr/ntなどからMMseqs2のデータベースとしてビルド済みのデータベースをダウンロードして、MMseqs2によるホモロジーサーチや分類群のアサイン(*1)に使用することができます。
インストール
以前の記事を参照
> mmseqs -h
# mmseqs -h
MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast,
parallelized protein sequence searches and clustering of huge protein sequence data sets.
Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).
MMseqs2 Version: 13.45111
© Martin Steinegger (martin.steinegger@snu.ac.kr)
usage: mmseqs <command> [<args>]
Easy workflows for plain text input/output
easy-search Sensitive homology search
easy-linsearch Fast, less sensitive homology search
easy-cluster Slower, sensitive clustering
easy-linclust Fast linear time cluster, less sensitive clustering
easy-taxonomy Taxonomic classification
easy-rbh Find reciprocal best hit
Main workflows for database input/output
search Sensitive homology search
linsearch Fast, less sensitive homology search
map Map nearly identical sequences
rbh Reciprocal best hit search
linclust Fast, less sensitive clustering
cluster Slower, sensitive clustering
clusterupdate Update previous clustering with new sequences
taxonomy Taxonomic classification
Input database creation
databases List and download databases
createdb Convert FASTA/Q file(s) to a sequence DB
createindex Store precomputed index on disk to reduce search overhead
createlinindex Create linsearch index
convertmsa Convert Stockholm/PFAM MSA file to a MSA DB
tsv2db Convert a TSV file to any DB
tar2db Convert content of tar archives to any DB
msa2profile Convert a MSA DB to a profile DB
Handle databases on storage and memory
compress Compress DB entries
decompress Decompress DB entries
rmdb Remove a DB
mvdb Move a DB
cpdb Copy a DB
lndb Symlink a DB
unpackdb Unpack a DB into separate files
touchdb Preload DB into memory (page cache)
Unite and intersect databases
createsubdb Create a subset of a DB from list of DB keys
concatdbs Concatenate two DBs, giving new IDs to entries from 2nd DB
splitdb Split DB into subsets
mergedbs Merge entries from multiple DBs
subtractdbs Remove all entries from first DB occurring in second DB by key
Format conversion for downstream processing
convertalis Convert alignment DB to BLAST-tab, SAM or custom format
createtsv Convert result DB to tab-separated flat file
convert2fasta Convert sequence DB to FASTA format
result2flat Create flat file by adding FASTA headers to DB entries
createseqfiledb Create a DB of unaligned FASTA entries
taxonomyreport Create a taxonomy report in Kraken or Krona format
Sequence manipulation/transformation
extractorfs Six-frame extraction of open reading frames
extractframes Extract frames from a nucleotide sequence DB
orftocontig Write ORF locations in alignment format
reverseseq Reverse (without complement) sequences
translatenucs Translate nucleotides to proteins
translateaa Translate proteins to lexicographically lowest codons
splitsequence Split sequences by length
masksequence Soft mask sequence DB using tantan
extractalignedregion Extract aligned sequence region from query
Result manipulation
swapresults Transpose prefilter/alignment DB
result2rbh Filter a merged result DB to retain only reciprocal best hits
result2msa Compute MSA DB from a result DB
result2dnamsa Compute MSA DB with out insertions in the query for DNA sequences
result2stats Compute statistics for each entry in a DB
filterresult Pairwise alignment result filter
offsetalignment Offset alignment by ORF start position
proteinaln2nucl Transform protein alignments to nucleotide alignments
result2repseq Get representative sequences from result DB
sortresult Sort a result DB in the same order as the prefilter or align module
summarizealis Summarize alignment result to one row (uniq. cov., cov., avg. seq. id.)
summarizeresult Extract annotations from alignment DB
Taxonomy assignment
createtaxdb Add taxonomic labels to sequence DB
createbintaxonomy Create binary taxonomy from NCBI input
addtaxonomy Add taxonomic labels to result DB
taxonomyreport Create a taxonomy report in Kraken or Krona format
filtertaxdb Filter taxonomy result database
filtertaxseqdb Filter taxonomy sequence database
aggregatetax Aggregate multiple taxon labels to a single label
aggregatetaxweights Aggregate multiple taxon labels to a single label
lcaalign Efficient gapped alignment for lca computation
lca Compute the lowest common ancestor
majoritylca Compute the lowest common ancestor using majority voting
Multi-hit search
multihitdb Create sequence DB for multi hit searches
multihitsearch Search with a grouped set of sequences against another grouped set
besthitperset For each set of sequences compute the best element and update p-value
combinepvalperset For each set compute the combined p-value
mergeresultsbyset Merge results from multiple ORFs back to their respective contig
Prefiltering
prefilter Double consecutive diagonal k-mer search
ungappedprefilter Optimal diagonal score search
kmermatcher Find bottom-m-hashed k-mer matches within sequence DB
kmersearch Find bottom-m-hashed k-mer matches between target and query DB
Alignment
align Optimal gapped local alignment
alignall Within-result all-vs-all gapped local alignment
transitivealign Transfer alignments via transitivity
rescorediagonal Compute sequence identity for diagonal
alignbykmer Heuristic gapped local k-mer based alignment
Clustering
clust Cluster result by Set-Cover/Connected-Component/Greedy-Incremental
clusthash Hash-based clustering of equal length sequences
mergeclusters Merge multiple cascaded clustering steps
Profile databases
result2profile Compute profile DB from a result DB
msa2result Convert a MSA DB to a profile DB
msa2profile Convert a MSA DB to a profile DB
profile2pssm Convert a profile DB to a tab-separated PSSM file
profile2consensus Extract consensus sequence DB from a profile DB
profile2repseq Extract representative sequence DB from a profile DB
convertprofiledb Convert a HH-suite HHM DB to a profile DB
Profile-profile databases
enrich Boost diversity of search result
result2pp Merge two profile DBs by shared hits
profile2cs Convert a profile DB into a column state sequence DB
convertca3m Convert a cA3M DB to a result DB
expandaln Expand an alignment result based on another
expand2profile Expand an alignment result based on another and create a profile
Utility modules to manipulate DBs
view Print DB entries given in --id-list to stdout
apply Execute given program on each DB entry
filterdb DB filtering by given conditions
swapdb Transpose DB with integer values in first column
prefixid For each entry in a DB prepend the entry key to the entry itself
suffixid For each entry in a DB append the entry key to the entry itself
renamedbkeys Create a new DB with original keys renamed
Special-purpose utilities
diffseqdbs Compute diff of two sequence DBs
summarizetabs Extract annotations from HHblits BLAST-tab-formatted results
gff2db Extract regions from a sequence database based on a GFF3 file
maskbygff Mask out sequence regions in a sequence DB by features selected from a GFF3 file
convertkb Convert UniProtKB data to a DB
summarizeheaders Summarize FASTA headers of result DB
nrtotaxmapping Create taxonomy mapping for NR database
extractdomains Extract highest scoring alignment regions for each sequence from BLAST-tab file
countkmer Count k-mers
Bash completion for modules and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".
Include the location of the MMseqs2 binary in your "$PATH" environment variable.
#Version: 13.45111
>mmseqs databases
# mmseqs databases
usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]
Name Type Taxonomy Url
- UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref
- UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org
- NR Aminoacid yes https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- GTDB Aminoacid yes https://gtdb.ecogenomic.org
- PDB Aminoacid - https://www.rcsb.org
- PDB70 Profile - https://github.com/soedinglab/hh-suite
- Pfam-A.full Profile - https://pfam.xfam.org
- Pfam-A.seed Profile - https://pfam.xfam.org
- Pfam-B Profile - https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released
- CDD Profile - https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
- eggNOG Profile - http://eggnog5.embl.de
- dbCAN2 Profile - http://bcb.unl.edu/dbCAN2
- SILVA Nucleotide yes https://www.arb-silva.de
- Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari Nucleotide yes https://github.com/lskatz/Kalamari
options:
--compressed INT Write compressed output [0]
--threads INT Number of CPU-cores used (all by default) [12]
-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]
references:
- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)
- Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, 2020.11.27.401018 (2020)
Show an extended list of options by calling 'mmseqs databases -h'.
#mmseqs databases -hで詳細なヘルプ
Taxonomy欄でyesとなっているデータベースは、MMseqs2のデータベースであるseqTaxDBに完全に対応しており、各データベースのキーとそのtaxon IDのマッピングファイルが含まれている。つまり、対応したデータベースをダウンロードすれば、そのままクエリの配列に分類学のラベルを割り当てることができるために使用できる。MMseqs2のseqTaxDBに対して配列を検索した結果のデータベースは、taxonomyResultと呼ばれる(マニュアルより)。
実行方法
1、データベースのダウンロード
ここではseqTaxDBに対応しており、データベースサイズが小さいSILVAを使う。
mkdir tmp
mmseqs databases SILVA SILVA_database tmp
出力
2、クエリの配列(ここでは塩基配列)をデータベースに変換する。
mmseqs createdb inout.fasta queryDB
読み込みを高速化するために、追加でtargetDB のインデックスファイルを計算することもできる(mmseqs createindex queryDB tmp search-type 3)。
3、一時ディレクトリの作成(高いI/Oが必要なので大規模な検索ではSSDなどを使う。ただし、tmpには十分な空き容量がないといけない)。
ここではカレントに作成する。
mkdir tmp
4、ここでは塩基配列のクエリを塩基配列のデータベースに対して検索。
mmseqs search queryDB SILVA_database resultDB tmp --serach-type 3
- -s sensitivity: 1.0 faster; 4.0 fast default; 7.5 sensitive (default 5.7)
5、結果のデータベースをBLASTタブ形式のファイルに変換する。
mmseqs convertalis queryDB SILVA_database resultDB result.m8
引用
*1
Fast and sensitive taxonomic assignment to metagenomic contigs
M Mirdita, M Steinegger, F Breitwieser, J Söding, E Levy Karin
Bioinformatics, Published: 18 March 2021
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Steinegger M, Söding J
Nat Biotechnol. 2017 Nov;35(11):1026-1028
MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.
Hauser M, Steinegger M, Söding J
Bioinformatics. 2016 May 1;32(9):1323-30.