minimap2を高速化した mm2-fast - macでインフォマティクス

2022/06/14 ツイート追加

　ロングリードシーケンシングは、ゲノミクスやトランスクリプトミクスの分野で日常的に使用されている。ロングリードやドラフトゲノムアセンブリを参照配列にマッピングすることは、これらのアプリケーションにおいて最も時間のかかるステップの一つである。ここでは、マッピングに広く使われているソフトウェアであるminimap2を高速化する技術を紹介する。SIMD並列化、効率的なキャッシュ利用、学習済みインデックスデータ構造を用いた複数の最適化を行い、3つの主要な計算モジュール（シーディング、チェイニング、ペアワイズシーケンスアライメント）を高速化する。これらの最適化により、同一の出力を維持しながら、minimap2のエンド・ツー・エンドのマッピング時間を最大で3.5倍に短縮することができた。

2022/09/02

Excited to share a blog summarizing our work on mm2-fast - our accelerated version of minimap2 that is up to 1.8x faster: https://t.co/cLypctu56F. This work also got published at "Nature Computational Science" earlier this year. Code availability https://t.co/LXWKHBJPEM #Minimap2
— Saurabh Kalikar (@Saurabh_Kalikar) 2022年9月1日

Pre-release of mm2-fast - our accelerated version of minimap2 - is out! Compatible with minimap2 v2.22. Working on another release that would be compatible with latest minimap2 version (v2.24). #genomics #longreads #Minimap2 #HPC @Saurabh_Kalikar @chirgjain @wasim_galaxy @lh3lh3 pic.twitter.com/XHHFwbiKRy
— Sanchit Misra (I am hiring) (@sanchit_misra) June 15, 2022

I am thrilled to present mm2-fast: an accelerated version of Minimap2 that achieves up to 3.5x speedup on CPUs while maintaining identical output. Code: https://t.co/yjjNhF41EK #genomics #longreads #Minimap2 #HPC @Saurabh_Kalikar @chirgjain @wasim_galaxy @lh3lh3 https://t.co/o2u8oyCXrm
— Sanchit Misra (@sanchit_misra) July 24, 2021

Githubより

mm2-fast は minimap2 を最新の CPU 上で加速して実装したものです。mm2-fast は minimap2 の 3 つの主要なモジュール、(a) seeding, (b) chaining, (c) pairwise alignment をすべて加速し、minimap2-v2.18に比べて最大 3.5 倍のスピードアップを実現します。現在のバージョンでは、すべてのモジュールがAVX-512ベクトル化によって最適化されています。ベンチマークの詳細な結果は、我々のプレプリントに掲載されています。

インストール

WSL2環境でAVX2を有効にしてビルドした(3700x)。

Operating System: Linux
mm2-fast was tested using g++ (GCC) 9.2.0 and icpc version 19.1.3.304
Architecture: x86_64 CPUs with AVX512
Memory requirement: ~30GB for human genome

Github；minimap2のgithubページのfast-contribブランチ

git clone --recursive https://github.com/lh3/minimap2.git -b fast-contrib mm2-fast 
cd mm2-fast/

## Enable optimized seeding and AVX2 based alignment for AVX2 systems
make clean && make -j

> ./minimap2

$ ./minimap2

Using default hash lookup.

Using default chaining.

Using default SSE-vectorized alignment.

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

Indexing:

-H use homopolymer-compressed k-mer (preferrable for PacBio)

-k INT k-mer size (no larger than 28) [15]

-w INT minimizer window size [10]

-I NUM split index for every ~NUM input bases [4G]

-d FILE dump index to FILE

Mapping:

-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]

-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]

-G NUM max intron length (effective with -xsplice; changing -r) [200k]

-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]

-r NUM bandwidth used in chaining and DP-based alignment [500]

-n INT minimal number of minimizers on a chain [3]

-m INT minimal chaining score (matching bases minus log gap penalty) [40]

-X skip self and dual mappings (for the all-vs-all mode)

-p FLOAT min secondary-to-primary score ratio [0.8]

-N INT retain at most INT secondary alignments [5]

Alignment:

-A INT matching score [2]

-B INT mismatch penalty [4]

-O INT[,INT] gap open penalty [4,24]

-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]

-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]

-s INT minimal peak DP alignment score [80]

-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

Input/Output:

-a output in the SAM format (PAF by default)

-o FILE output alignments to FILE [stdout]

-L write CIGAR with >65535 ops at the CG tag

-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar'

-c output CIGAR in PAF

--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]

--MD output the MD tag

--eqx write =/X CIGAR operators

-Y use soft clipping for supplementary alignments

-t INT number of threads [3]

-K NUM minibatch size for mapping [500M]

--version show version number

Preset:

-x STR preset (always applied before other options; see minimap2.1 for details) []

- map-pb/map-ont - PacBio/Nanopore vs reference mapping

- ava-pb/ava-ont - PacBio/Nanopore read overlap

- asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence

- splice/splice:hq - long-read/Pacbio-CCS spliced alignment

- sr - genomic short-read mapping

See `man ./minimap2.1' for detailed description of these and other advanced command-line options.

makeによるデフォルトのコンパイルでは、2つの最適化が適用される（Github参照）。

テストラン

./minimap2 -ax map-ont test/MT-human.fa test/MT-orang.fa --max-chain-skip=1000000 > minimap2_output

引用

Accelerating long-read analysis on modern CPUs
Saurabh Kalikar, Chirag Jain, Vasimuddin Md, Sanchit Misra

bioRxiv, Posted July 23, 2021

Accelerating minimap2 for long-read sequencing applications on modern CPUs
Saurabh Kalikar, Chirag Jain, Md Vasimuddin & Sanchit Misra
Nature Computational Science volume 2, pages78–83 (2022)