塩基配列アラインメント用ライブラリ BSAlign

　塩基配列アライメントの精度を高めることはゲノム研究において必須の課題である。古典的な動的プログラミングアルゴリズム（Smith-WatermanやNeedleman-Wunschなど）は最適な結果を出すことを保証しているが、その時間の複雑さが大規模配列アライメントへの応用を妨げている。アラインメントプロセスを高速化することを目的とした多くの最適化努力は、一般に3つの観点からもたらされる：データ構造の再設計（例えば、対角またはstriped Single Instruction Multiple Data (SIMD) 実装）、SIMD演算における並列数の増加（例えば、差分再帰関係）、または検索空間の削減（例えば、バンド動的計画法）。しかし、これら3つの側面を組み合わせて超高速アルゴリズムを構築する手法はない。本著者らは、BSAlignと名付けたBanded Striped Aligner(ライブラリ)を開発した。BSAlignは、前述の3つの観点の全てを利用するために、一連の新しい手法を編み出すことによって、超高速で正確なアライメント結果を提供する。この新しい高速化設計を、通常のペアワイズアライメントと編集距離ペアワイズアライメントの両方に適用した。BSAlignは、通常のペアワイズアラインメントでは、他のSIMDベースの実装よりも2倍高速化しており、長いリードの編集距離に基づく実装では、1.5倍から4倍の高速化を達成した。BSAlignはC言語で実装されており、https://github.com/ruanjue/bsalignから利用できる。

インストール

Github

git clone https://github.com/ruanjue/bsalign.git
cd bsalign
make

cd example
sh run.sh

> ./bsalign

Program: bsalign

Version: 1.2.1

Author : Jue Ruan <ruanjue@caas.cn>

Usage : bsalign <cmd> [options]

commands:

align Pairwise alignment implemented by 8-bit encoded Banded Striped SIMD

edit Pairwise alignment using edit distance implemented by 2-bit encoded banded Striped algorithm

poa Multiple alignment implemented by 8-bit encoded Banded Striped SIMD Partial Order Alignment

cat Concatenate pieces of seqs into one seq by overlaping

Tips:

# To invoke affine gap cost pairwise/multiple alignment

-M 2 -X 6 -O 3 -E 2 -Q 0 -P 0

# To invoke 2-piece gap cost pairwise/multiple alignment

-M 2 -X 6 -O 3 -E 2 -Q 8 -P 1

> ./bsalign align

Usage: bsalign align [options] <input fasta/q.gz file>

-m <string> Align mode: global/extend/overlap, [overlap]

-W <int> Bandwidth, 0: full length, [0]

-M <int> Score for match, [2]

-X <int> Penalty for mismatch, [6]

-O <int> Penalty for gap open, [3]

-E <int> Penalty for gap extension, [2]

-Q <int> Penalty for gap2 open, [0]

-P <int> Penalty for gap2 extension, [0]

-L <int> Line width for alignment output, [0]

-R <int> Repeat times (for benchmarking) [1]

-v Verbose

# To invoke linear gap cost pairwise alignment

-M 2 -X 6 -O 0 -E 3 -Q 0 -P 0

# To invoke affine gap cost pairwise alignment

-M 2 -X 6 -O 3 -E 2 -Q 0 -P 0

# To invoke 2-piece gap cost pairwise alignment

-M 2 -X 6 -O 3 -E 2 -Q 8 -P 1

> ./bsalign edit

Usage: bsalign edit [option] <input fasta/q.gz file>

-m <string> Align mode: global/extend/overlap/kmer, [global]

in overlap and extend mode, disable bandwidth

-W <int> Bandwidth, 0: full length, [0]

-k <int> Kmer size (<=15), [13]

-v Verbose

-R <int> Repeat times (for benchmarking) [1]

> ./bsalign poa

Usage: bsalign poa [option] <input fasta/q.gz file>

-o <string> Consensus fasta file, [NULL]

-m <string> Align mode: global/extend/overlap, [global]

-W <int> Bandwidth, 0: full length, [128]

-M <string> Score for match for poa and local realignment [2,2]

-X <string> Penalty for mismatch, [6,6]

-O <string> Penalty for gap open, [3,3]

-E <string> Penalty for gap extension, [2,2]

-Q <string> Penalty for gap2 open, [0,0]

-P <string> Penalty for gap2 extension, [0,0]

-G <string> misc parameters for POA, <tag>=<val>

Defaults: seqcore=40,shuffle=1,refmode=0,refbonus=1,nrec=20,kmer=15,trigger=1,realn=3,editbw=32

,althi=5,qlthi=70,psub=0.10,pins=0.10,pdel=0.15,piex=0.15,pdex=0.20,hins=0.20,hdel=0.40

,varcnt=3,covfrq=0.5,snvqlt=5

seqcore: number of seqs in core MSA, addtional reads will be realigned (realn > 0) against core MSA profile to build a full MSA

shuffle: whether to shuffle/sort the reads according to most kmers matches first

refmode: whether the first sequences is reference sequence

refbonus: base match score on reference will be M + refbonus

nrec: every query read is aligning against previous <nrec> reads on graph, 0 to all the previous

trigger: when <trigger> > 0 and <-W> < query length, genrates CNS per after <trigger> reads, and trigger banded alignment

realn: rounds of realignment

editbw=32: bandwidth in edit MSA

althi: cutoff of high alt base quality, used in tidying out possible SNV sites

qlthi: cutoff of high cns base quality, used in print colorful MSA

psub/pins/pdel/piex/pdex: for consensus, probs. of mis/ins/del/ins_ext/del_ext

hins/hdel: probs of ins/del in homopolymer region

varcnt: min count of variant base in SNV sites

covfrq: min spanned_reads / total_reads in SNV sites

snvqlt: - log10(p-value), 5 ~= 1e-5

-L Print MSA in 'one seq one line'

-C Print MSA in colorful format

-R <int> Repeat times (for benchmarking) [1]

-v Verbose

# different sequencing error pattern

tunes psub,pins,pdel,piex,pdex,hins,hdel

# Treats with reads having large offsets with each other in multiple alignment

set '-G trigger=0' to disable banded alignment, will be much much slower

# Many reads

Don't warry, it only take linear time, except that you set a big seqcore value

# false positive SNV

It tends to report as many SNVs as it can, your turn to filter them

> ./bsalign cat -h

Usage: bsalign cat [option] <input fasta/q.gz file>

-o <string> Consensus fasta file, [NULL]

-W <int> Overlap size, [1024]

You can specify individual overlap by add 'overlap=2048' to its seq header

-M <string> Score for match for poa and local realignment [2,1]

-X <string> Penalty for mismatch, [6,2]

-O <string> Penalty for gap open, [3,0]

-E <string> Penalty for gap extension, [2,1]

-v Verbose

テストラン

cd example
sh run.sh

https://github.com/ruanjue/bsalign/blob/master/example/run.shが実行される。

各結果は4行で表示される。

（レポジトリより転載）

1行目： 1列目-RefName; 2列目-RefLength; 3列目-RefStrand; 4列目-RefStart; 5列目-RefEnd; 6列目-QueryName; 7列目-QueryLength; 8列目-QueryStrand; 9列目-QueryStart； 10列目-QueryEnd; 11列目-AlignmentScore; 12列目-Identity、13列目-NumberOfMatch、14列目-NumberOfMismatch、15列目-NumberOfDeletion、16列目-NumberOfInsertion

2行目：参照配列

3行目：'|'、'*'、'-'はそれぞれマッチ、ミスマッチ、indelを意味する

4行目：クエリ配列

引用

BSAlign: a library for nucleotide sequence alignment
Haojing Shao, Jue Ruan

bioRxiv, Posted January 17, 2024