2018-09-22

高速かつ高感度なプロテイン検索ツール MMseqs2

2019 3/18　タイトル修正

2019 5/7　タイトル修正

2019 5/20 インストール追記

2019 8/25 twtwitter追記

2019 12/29, 2021 9/15インストール追記

　DNAシーケンシングのスループットは、過去10年間で計算速度よりもはるかに速くなってきており、感度の高いシーケンス検索は、ラージメタゲノムデータセットの分析における主要なボトルネックになっている。それゆえ、著者らは、速度と感度のトレードオフの全レンジににわたって現在の検索ツールを改良し、400以上の倍の速度で、PSI-BLASTよりも良好な感度を達成し400倍以上高速なMMseqs2を開発した。

　2007年以来、シーケンシングコストが4桁低下した結果、テラバイトの配列を生産する大規模なメタゲノムプロジェクトが、医療、生物工学、微生物学、農業研究に応用されて多数行われている（論文より ref.1,2,3,4）。計算分析の中心的なステップは、データベース内の類似の配列を検索することによってオープンリーディングフレームの機能を推論するアノテーションである。メタゲノミクスでは、現在、計算コストがシーケンシングコストに対して支配的になっており(ref.5,6,7)、その中でタンパク質検索は、感度は高いが低速なBLASTがはるかに高速な検索ツールで置き換えられているにもかかわらず、一般に計算資源の90％を消費している（ref,9,10,11,12）。しかし、それらのツールのスピードは感度低下を犠牲にして行われている。メタゲノミクスとメタトランスクリプトミクス研究で見つかる多くの種は、よくアノテーションづけられたゲノムとはclosely relatedではないので、unannotatableシーケンスの割合は、多くの場合、65から90パーセントに昇る（ref.2,13）。シーケンシングと計算コストとのギャップ拡大により、すぐにこの問題は悪化する。

　この課題に対処するために、著者らはープンソースの並列化ソフトウエアスイートMMseqs2を開発した。MMseqs2は以前のMMseqs（ref.14 pubmed）と比較してはるかに高感度で、反復的なプロファイル間シーケンスとプロファイル間検索をサポートし、さらに強化された機能を提供する（論文のSupplementary Table 1）。

　MMseqs2のサーチは３つのステージ、短いワード（ 'k-mer'）マッチステージ、ベクトル化されたungappedアライメント、およびgapped（Smith-Waterman）アライメントという3つのステージから構成されている（論文図1a）。第1段階は、改善されたパフォーマンスにとって重要である。与えられたクエリ配列に対して、同じ対角線上に2つの連続する類似のk-mer一致を有するすべての標的配列を見つける（論文図1b）。連続的なk-merの一致は、相同配列の場合、同じ対角線上にあることがよくあるが（それらの間にアライメントギャップがない場合）、偶然に起こる可能性は低い。ほとんどの高速なツールは正確なk-merマッチのみを検出するが、MMseqs2、MMseqs、BLASTは類似k-mer間のrマッチも見つける。このような類似k-mer照合は、MMseqs2が、類似性設定に応じて、多数の同様のk-merを生成することによって（１クエリに対して600 から 60,000）感度を失うことなく大きな単語サイズk = 7を使用することを可能にする。 MMseqs2の速度については、最もの内側のループ4（Supplementary Information Figure S 1 のマゼンタ色のフレーム内のループ）の最後の行のランダムメモリアクセスを排除する方法を発見することが重要だった。

f:id:kazumaxneo:20180922182940p:plain

The average AUC sensitivity versus speed-up factor relative to BLAST for 637,000 test searches. The sensitivity is called area under the curve (AUC).

ユーザーガイドより転載。AUCの定義についてはSupplementary Information参照。

MMseqs2/Linclust updates by following Martin

We use a lot of SSE and AVX2 code in MMseqs2 https://t.co/mM7anqG0HL. I have also optimized the Viterbi algorithm of HHsuite https://t.co/U1KYkQUl2N. The speed up was 4 (SSSE3) to 8 (AVX2) times over the single instruction version.
— Martin Steinegger (@thesteinegger) December 20, 2018

2019 8/25

New MMseqs2 release https://t.co/TyYnm0l1LY
Highlights: @fbreitw's taxonomy reports work with Pavian(https://t.co/tUpYKKYxfU); @milot_mirdita's split indices; Eli improved MPI support; reduced memory and disk space; multiple FASTA/Q files in easy workflows; better command line UX pic.twitter.com/AXNV5TYbNL
— Martin Steinegger (@thesteinegger) August 23, 2019

2020 9/4

New MMseqs2 (12-113e3) release at a glance:
- faster clustering thanks to ips4o sort
- full nucleotide clustering support
- new databases SILVA, Pfam-B, dbCAN2
- SSE2/ARM64 and PPC64LE support
Newest version is already in Conda and Debian (unstable)https://t.co/3yiK21B4CV
— Martin Steinegger (마틴 스타이네거) (@thesteinegger) 2020年9月4日

ユーザーガイド

https://github.com/soedinglab/mmseqs2/wiki#set-sensitivity--s-parameter

インストール

mac os 10.13でテストした。

MMseqs can be installed by compiling the binary from source, download a statically compiled version, using Homebrew or Docker.

MMseqs2 Githubレポジトリ

ソースからのビルドに必要なもの

git
g++ (4.6 or higher)
cmake (3.0 or higher)

#ソースからビルドする 
git clone https://github.com/soedinglab/MMseqs2.git
cd MMseqs2
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make -j 16
make install 
export PATH=$(pwd)/bin/:$PATH

#bioconda (link)
mamba install -c conda-forge -c bioconda mmseqs2

#docker
docker pull soedinglab/mmseqs2

他に、homebrewによるインストール、staticバイナリが利用できる。手順はMMseqs2レポジトリのREADME.md参照。

> mmseqs

$ mmseqs

MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast,

parallelized protein sequence searches and clustering of huge protein sequence data sets.

Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).

MMseqs2 Version: aa4e079915dcfaba34fa58206144180b64ef6258

Easy workflows (for non-experts)

easy-search Search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step

easy-linclust Compute clustering of a fasta database in linear time. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences.

easy-cluster Compute clustering of a fasta database. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences.

Main tools (for non-experts)

createdb Convert protein sequence set in a FASTA file to MMseqs sequence DB format

search Search with query sequence or profile DB (iteratively) through target sequence DB

map Fast ungapped mapping of query sequences to target sequences.

cluster Compute clustering of a sequence DB (quadratic time)

linclust Cluster sequences of >30% sequence identity *in linear time*

createindex Precompute index table of sequence DB for faster searches

clusterupdate Update clustering of old sequence DB to clustering of new sequence DB

Utility tools for format conversions

createtsv Create tab-separated flat file from prefilter DB, alignment DB, cluster DB, or taxa DB

convertalis Convert alignment DB to BLAST-tab format, SAM flat file, or to raw pairwise alignments

convertprofiledb Convert ffindex DB of HMM/HMMER3/PSSM files to MMseqs profile DB

convert2fasta Convert sequence DB to FASTA format

result2flat Create a FASTA-like flat file from prefilter DB, alignment DB, or cluster DB

createseqfiledb Create DB of unaligned FASTA files (1 per cluster) from sequence DB and cluster DB

Taxonomy tools

taxonomy Compute taxonomy and lowest common ancestor for each sequence.

lca Compute the lowest common ancestor from a set of taxa.

An extended list of all tools can be obtained by calling 'mmseqs -h'.

Bash completion for tools and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".

Include the location of the MMseqs2 binary is in your "$PATH" environment variable.

> mmseqs createdb -h

$ mmseqs createdb -h

mmseqs createdb:

converts a protein sequence flat/gzipped FASTA or FASTQ file to the MMseqs sequence DB format. This format is needed as input to mmseqs search, cluster and many other tools.

Please cite:

Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

Usage: <i:fastaFile1[.gz]> ... <i:fastaFileN[.gz]> <o:sequenceDB> [options]

misc options default description [value range]

--dont-split-seq-by-len true Dont split sequences by --max-seq-len

--dont-shuffle true Do not shuffle input database

--id-offset 0 numeric ids in index file are offset by this value

common options default description [value range]

--max-seq-len 65535 Maximum sequence length [1,32768]

-v 3 verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info

> mmseqs search -h

$ mmseqs search -h

mmseqs search:

Searches with the sequences or profiles query DB through the target sequence DB by running the prefilter tool and the align tool for Smith-Waterman alignment. For each query a results file with sequence matches is written as entry into a database of search results (alignmentDB).

In iterative profile search mode, the detected sequences satisfying user-specified criteria are aligned to the query MSA, and the resulting query profile is used for the next search iteration. Iterative profile searches are usually much more sensitive than (and at least as sensitive as) searches with single query sequences.

Please cite:

Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

Usage: <i:queryDB> <i:targetDB> <o:alignmentDB> <tmpDir> [options]

prefilter options default description [value range]

--comp-bias-corr 1 correct for locally biased amino acid composition [0,1]

--add-self-matches false artificially add entries of queries with themselves (for clustering)

-s 5.700 sensitivity: 1.0 faster; 4.0 fast default; 7.5 sensitive [1.0,7.5]

-k 0 k-mer size in the range [6,7] (0: set automatically to optimum)

--k-score 2147483647 k-mer threshold for generating similar-k-mer lists

--alph-size 21 alphabet size [2,21]

--offset-result 0 Offset result list

--split 0 Splits input sets into N equally distributed chunks. The default value sets the best split automatically. createindex can only be used with split 1.

--split-mode 2 0: split target db; 1: split query db; 2: auto, depending on main memory

--split-memory-limit 0 Maximum system memory in megabyte that one split may use. Defaults (0) to all available system memory.

--diag-score 1 use diagonal score for sorting the prefilter results [0,1]

--exact-kmer-matching 0 only exact k-mer matching [0,1]

--mask 1 0: w/o low complexity masking, 1: with low complexity masking

--min-ungapped-score 15 accept only matches with ungapped alignment score above this threshold

--spaced-kmer-mode 1 0: use consecutive positions a k-mers; 1: use spaced k-mers

align options default description [value range]

-a false add backtrace string (convert to alignments with mmseqs convertalis utility)

--alignment-mode 2 How to compute the alignment: 0: automatic; 1: only score and end_pos; 2: also start_pos and cov; 3: also seq.id; 4: only ungapped alignment

-e 0.001 list matches below this E-value [0.0, inf]

--min-seq-id 0.000 list matches above this sequence identity (for clustering) [0.0,1.0]

--seq-id-mode 0 0: alignment length 1: shorter, 2: longer sequence

--alt-ali 0 Show up to this many alternative alignments

-c 0.000 list matches above this fraction of aligned (covered) residues (see --cov-mode)

--cov-mode 0 0: coverage of query and target, 1: coverage of target, 2: coverage of query 3: target seq. length needs be at least x% of query length

--realign false compute more conservative, shorter alignments (scores and E-values not changed)

--max-rejected 2147483647 maximum rejected alignments before alignment calculation for a query is aborted

--max-accept 2147483647 maximum accepted alignments before alignment calculation for a query is stopped

--score-bias 0.000 Score bias when computing the SW alignment (in bits)

profile options default description [value range]

--pca 1.000 pseudo count admixture strength

--pcb 1.500 pseudo counts: Neff at half of maximum admixture (0.0,infinity)

--mask-profile 1 mask query sequence of profile using tantan [0,1]

--e-profile 0.001 includes sequences matches with < e-value thr. into the profile [>=0.0]

--wg false use global sequence weighting for profile calculation

--filter-msa 1 filter msa: 0: do not filter, 1: filter

--max-seq-id 0.900 reduce redundancy of output MSA using max. pairwise sequence identity [0.0,1.0]

--qid 0.000 reduce diversity of output MSAs using min.seq. identity with query sequences [0.0,1.0]

--qsc -20.000 reduce diversity of output MSAs using min. score per aligned residue with query sequences [-50.0,100.0]

--cov 0.000 filter output MSAs using min. fraction of query residues covered by matched sequences [0.0,1.0]

--diff 1000 filter MSAs by selecting most diverse set of sequences, keeping at least this many seqs in each MSA block of length 50

--num-iterations 1 Search iterations

misc options default description [value range]

--no-preload false Do not preload database

--rescore-mode 0 Rescore diagonal with: 0: Hamming distance, 1: local alignment (score only) or 2: local alignment

--min-length 30 minimum codon number in open reading frames

--max-length 32734 maximum codon number in open reading frames

--max-gaps 2147483647 maximum number of codons with gaps or unknown residues before an open reading frame is rejected

--contig-start-mode 2 Contig start can be 0: incomplete, 1: complete, 2: both

--contig-end-mode 2 Contig end can be 0: incomplete, 1: complete, 2: both

--orf-start-mode 0 Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle)

--forward-frames 1,2,3 comma-seperated list of ORF frames on the forward strand to be extracted

--reverse-frames 1,2,3 comma-seperated list of ORF frames on the reverse strand to be extracted

--translation-table 1 1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE, 9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL, 15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL, 23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA, 29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA

--use-all-table-starts false use all alteratives for a start codon in the genetic table, if false - only ATG (AUG)

--id-offset 0 numeric ids in index file are offset by this value

--add-orf-stop false add * at complete start and end

--start-sens 4.000 start sensitivity

--sens-steps 1 Search steps performed from --start-sense and -s.

--slice-search false For bigger profile DB, run iteratively the search by greedily swapping the search results.

common options default description [value range]

--sub-mat blosum62.out amino acid substitution matrix file

--max-seq-len 65535 Maximum sequence length [1,32768]

--max-seqs 300 maximum result sequences per query (this parameter affects the sensitivity)

--threads 1 number of cores used for the computation (uses all cores by default)

-v 3 verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info

> mmseqs convertalis -h

$ mmseqs convertalis -h

mmseqs convertalis:

Convert alignment DB to BLAST-tab format, SAM flat file, or to raw pairwise alignments

Please cite:

Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

Usage: <i:queryDb> <i:targetDb> <i:alignmentDB> <o:alignmentFile> [options]

misc options default description [value range]

--format-mode 0 output format 0: BLAST-TAB, 1: PAIRWISE, 2: BLAST-TAB + query/db length

--no-preload false Do not preload database

--db-output false Output a result db instead of a text file

common options default description [value range]

--threads 1 number of cores used for the computation (uses all cores by default)

-v 3 verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info

実行方法

元々、クラスター環境での大規模なデータベース構築を想定して設計されているが、このブログではローカル環境でのコマンド実行方法のみ紹介する。

検索するには、まずクエリとデータベースそれぞれの配列をmmseq2のformatに変換する必要がある。テストファイルに対して実行してみる。

cd MMseqs2/examples/

#クエリのアミノ酸配列DBファイル作成
mmseqs createdb QUERY.fasta queryDB

#データベースのアミノ酸配列DBファイル作成
mmseqs createdb DB.fasta targetDB

クエリとデータベースのファイルを指定してホモロジーサーチを実行する。tmpは作業ディレクトリ。巨大なデータベースを使う場合、十分な容量とI/Oの高速なストレージの利用が推奨されている（詳細はユーザーガイド参照）。

mmseqs search queryDB targetDB resultDB tmp

-s sensitivity: 1.0 faster; 4.0 fast default; 7.5 sensitive (default 5.7)

感度は-sで1.0から7.5まで変更できる（マニュアルの説明）。目的に合わせて変更する。上の表にもあるように、感度により計算時間も変わる。

$ mmseqs createdb examples/DB.fasta targetDB

Program call:

createdb examples/DB.fasta targetDB

MMseqs Version: aa4e079915dcfaba34fa58206144180b64ef6258

Max. sequence length 65535

Split Seq. by len true

Do not shuffle input database true

Offset of numeric ids 0

Verbosity 3

.Time for merging files: 0h 0m 0s 3ms

Time for merging files: 0h 0m 0s 4ms

Time for merging files: 0h 0m 0s 3ms

Time for processing: 0h 0m 0s 126ms

結果をblast様のタブ仕分けファイルに変換する。

mmseqs convertalis queryDB targetDB resultDB output.txt --format-mode 0

--format-mode 0 output format 0: BLAST-TAB, 1: PAIRWISE, 2: BLAST-TAB + query/db length

他にも幾つか重要なコマンドがあります。機会を見て紹介します。

2019 6/28 追記

2019 12/29 追記

blastxサーチ

#step1 database
#クエリの塩基配列のDB
mmseqs createdb ecoli.fna ecoli_genome
#データベースのアミノ酸配列のDB
mmseqs createdb ecoli.faa ecoli_proteins


#step2 search
#BLASTX search
mmseqs search ecoli_genome ecoli_proteins alnDB tmp
#TBLASTN search
mmseqs search ecoli_proteins ecoli_genome alnDB tmp
#TBLASTX search
mmseqs search genome_orfs_aa ecoli_genome alnDB tmp --search-type 2

引用

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Steinegger M, Söding J

Nat Biotechnol. 2017 Nov;35(11):1026-1028

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.
Hauser M, Steinegger M, Söding J

Bioinformatics. 2016 May 1;32(9):1323-30.

Clustering huge protein sequence sets in linear time

Martin Steinegger & Johannes Söding

Nat Commun. 2018; 9: 2542.