ショートリードマッピングの高速化 strobealign

2022/04/15 インストール手順更新

　ショートリードのゲノムへのアラインメントは、多くのバイオインフォマティクス解析で用いられる基本的な計算ステップである。そのため、このような計算をできるだけ高速に行うことが望まれている。多くのアラインメントアルゴリズムは、seed-and-extend法を採用している。いくつかの一般的なプログラムは、Burrows-Wheeler変換に基づいてseedingステップを実行し、メモリ使用量は少ないが、minimizerベースのシーディング＆チェーン戦略を使用する最近のアプローチに比べて相対的に低速である。最近、配列比較のためにsyncmersとstrobemerが提案された。どちらの手法も、突然変異を受けた配列間のマッチの保存性を向上させるために設計された。syncmersはk-merサブサンプリング法であり、strobemersはギャップのある配列を連結する方法であり、k-merの代替法として提案された。本研究における主な貢献は、syncmersとstrobemersを組み合わせた新しいseed手法である。Strobemers が最初に提案されたように、配列全体のk-merを考慮するのではなく、strobemers 法（randstrobes）を用いてsyncmersを連結する。この方法により、マッピングの精度を保ちつつ、ヒット数を減らすことができる長いシードを作成することができ、高速なマッピングが可能になった。2つ目の貢献は、ショートリードアライナーであるStrobealignへのシード法の実装である。Strobealignは、シングルエンドとペアエンドの両方のアラインメントモードをサポートしており、シングルエンドのリードをminimap2の約2～3倍、BWAやBowtie2の12～15倍の速さで、同程度の精度でアラインメントする。ペアエンドモードでは、200nt以上の長さのリードを、minimap2の3倍以上の速さで、ほぼ同等の精度でアライメントし、BWAやBowtie2の約10倍の速さで、0.1-0.2%の精度の低下で済んでいる。本貢献はアルゴリズムによるもので、ハードウェアアーキテクチャやシステム固有の命令を必要としない。今回のseeding手法は、ロングリードのマッピングやクラスタリングなど、他のマッピングアプリケーションにも応用できると考えている。Strobealignは、https://github.com/ksahlin/strobealignに置かれている。

6/4

Long-term bioinformatics support was granted to @krsahlin! The 500 hours will be spent by @marcelm_ and colleagues to improve strobealign.https://t.co/z7jvW2Qc1D
— NBIS (@NBISwe) June 3, 2022

I have released strobealign v0.6. Strobealign now supports secondary alignments and has better accuracy. Full release notes for v0.5 and v0.6 at https://t.co/5fkZulS5eP. A small SNV and indel calling benchmark with bcftools here: https://t.co/vXCZvlNqfK. [1/2]
— Kristoffer Sahlin (@krsahlin) February 20, 2022

I've made several updates to the short-read aligner strobealign, described in a new version of the preprint. If you have short reads to align, I would be happy to receive feedback [1/4] https://t.co/LMf3aYOnOx
— Kristoffer Sahlin (@krsahlin) November 8, 2021

インストール

ubuntu18でソースコードからビルドした。

Github

#old
wget https://github.com/ksahlin/StrobeAlign/tree/main/bin/Linux/StrobeAlign-v0.0.3.1
#rename
mv StrobeAlign-v0.0.3.1 strobealign
chmod +x strobealign
./strobealign

#installation from source
git clone https://github.com/ksahlin/StrobeAlign
cd StrobeAlign
g++ -std=c++14 -I/path/to/zlib/include -L/path/to/zlib/lib  main.cpp source/index.cpp source/ksw2_extz2_sse.c source/xxhash.c source/ssw_cpp.cpp source/ssw.c source/pc.cpp source/aln.cpp -lz -lpthread -o strobealign -O3 -mavx2

> ./strobealign

StrobeAlign VERSION 0.7

StrobeAlign [options] <ref.fa> <reads1.fast[a/q.gz]> [reads2.fast[a/q.gz]]

options:

Resources:

-t INT number of threads [3]

Input/output:

-o STR redirect output to file [stdout]

-x Only map reads, no base level alignment (produces paf file)

-N INT retain at most INT secondary alignments (is upper bounded by -M, and depends on -S) [0]

-L STR Print statistics of indexing to logfie [log.csv]

Seeding:

-r INT Mean read length. This parameter is estimated from first 500 records in each read file. No need to set this explicitly

unless you have a reason. [disabled]

-m INT Maximum seed length. Defaults to r - 50. For reasonable values on -l and -u, the seed length distribution is usually determined by

parameters l and u. Then, this parameter is only active in regions where syncmers are very sparse.

-k INT strobe length, has to be below 32. [20]

-l INT Lower syncmer offset from k/(k-s+1). Start sample second syncmer k/(k-s+1) + l syncmers downstream [0]

-u INT Upper syncmer offset from k/(k-s+1). End sample second syncmer k/(k-s+1) + u syncmers downstream [7]

-c INT Bitcount length between 2 and 63. [8]

-s INT Submer size used for creating syncmers [k-4]. Only even numbers on k-s allowed.

A value of s=k-4 roughly represents w=10 as minimizer window [k-4]. It is recommended not to change this parameter

unless you have a good understanding of syncmenrs as it will drastically change the memory usage and results with non default values.

Alignment:

-A INT matching score [2]

-B INT mismatch penalty [8]

-O INT gap open penalty [12]

-E INT gap extension penalty [1]

Search parameters:

-f FLOAT top fraction of repetitive strobemers to filter out from sampling [0.0002]

-S FLOAT Try candidate sites with mapping score at least S of maximum mapping score [0.5]

-M INT Maximum number of mapping sites to try [20]

-R INT Rescue level. Perform additional search for reads with many repetitive seeds filtered out.

This search includes seeds of R*repetitive_seed_size_filter (default: R=2). Higher R than default makes StrobeAlign

significantly slower but more accurate. R <= 1 deactivates rescue and is the fastest.

実行方法

リファレンスとショートリードを指定する。

strobealign ref.fa reads.fa > output.sam

-k INT strobe length [22]
-s INT syncmer thinning parameter to sample strobes. A value of s=k-4 roughly represents w=10 as minimizer window [k-4].
-f FLOAT top fraction of repetitive syncmers to filter out from sampling [0.0002]
-t INT number of threads [3]

まだ開発中で今後より性能アップする可能性もあると思いますが、早めに紹介しました。 => その後、かなりバージョンアップされています。現在はv0.7が最新です。リリースヒストリーを確認して下さい。

引用

Faster short-read mapping with strobemer seeds constructed from syncmers
Kristoffer Sahlin

bioRxiv, Posted November 07, 2021