塩基配列アライメントの精度を高めることはゲノム研究において必須の課題である。古典的な動的プログラミングアルゴリズム(Smith-WatermanやNeedleman-Wunschなど)は最適な結果を出すことを保証しているが、その時間の複雑さが大規模配列アライメントへの応用を妨げている。アラインメントプロセスを高速化することを目的とした多くの最適化努力は、一般に3つの観点からもたらされる:データ構造の再設計(例えば、対角またはstriped Single Instruction Multiple Data (SIMD) 実装)、SIMD演算における並列数の増加(例えば、差分再帰関係)、または検索空間の削減(例えば、バンド動的計画法)。しかし、これら3つの側面を組み合わせて超高速アルゴリズムを構築する手法はない。本著者らは、BSAlignと名付けたBanded Striped Aligner(ライブラリ)を開発した。BSAlignは、前述の3つの観点の全てを利用するために、一連の新しい手法を編み出すことによって、超高速で正確なアライメント結果を提供する。この新しい高速化設計を、通常のペアワイズアライメントと編集距離ペアワイズアライメントの両方に適用した。BSAlignは、通常のペアワイズアラインメントでは、他のSIMDベースの実装よりも2倍高速化しており、長いリードの編集距離に基づく実装では、1.5倍から4倍の高速化を達成した。BSAlignはC言語で実装されており、https://github.com/ruanjue/bsalignから利用できる。
インストール
git clone https://github.com/ruanjue/bsalign.git
cd bsalign
make
cd example
sh run.sh
> ./bsalign
Program: bsalign
Version: 1.2.1
Author : Jue Ruan <ruanjue@caas.cn>
Usage : bsalign <cmd> [options]
commands:
align Pairwise alignment implemented by 8-bit encoded Banded Striped SIMD
edit Pairwise alignment using edit distance implemented by 2-bit encoded banded Striped algorithm
poa Multiple alignment implemented by 8-bit encoded Banded Striped SIMD Partial Order Alignment
cat Concatenate pieces of seqs into one seq by overlaping
Tips:
# To invoke affine gap cost pairwise/multiple alignment
-M 2 -X 6 -O 3 -E 2 -Q 0 -P 0
# To invoke 2-piece gap cost pairwise/multiple alignment
-M 2 -X 6 -O 3 -E 2 -Q 8 -P 1
> ./bsalign align
Usage: bsalign align [options] <input fasta/q.gz file>
-m <string> Align mode: global/extend/overlap, [overlap]
-W <int> Bandwidth, 0: full length, [0]
-M <int> Score for match, [2]
-X <int> Penalty for mismatch, [6]
-O <int> Penalty for gap open, [3]
-E <int> Penalty for gap extension, [2]
-Q <int> Penalty for gap2 open, [0]
-P <int> Penalty for gap2 extension, [0]
-L <int> Line width for alignment output, [0]
-R <int> Repeat times (for benchmarking) [1]
-v Verbose
# To invoke linear gap cost pairwise alignment
-M 2 -X 6 -O 0 -E 3 -Q 0 -P 0
# To invoke affine gap cost pairwise alignment
-M 2 -X 6 -O 3 -E 2 -Q 0 -P 0
# To invoke 2-piece gap cost pairwise alignment
-M 2 -X 6 -O 3 -E 2 -Q 8 -P 1
> ./bsalign edit
Usage: bsalign edit [option] <input fasta/q.gz file>
-m <string> Align mode: global/extend/overlap/kmer, [global]
in overlap and extend mode, disable bandwidth
-W <int> Bandwidth, 0: full length, [0]
-k <int> Kmer size (<=15), [13]
-v Verbose
-R <int> Repeat times (for benchmarking) [1]
> ./bsalign poa
Usage: bsalign poa [option] <input fasta/q.gz file>
-o <string> Consensus fasta file, [NULL]
-m <string> Align mode: global/extend/overlap, [global]
-W <int> Bandwidth, 0: full length, [128]
-M <string> Score for match for poa and local realignment [2,2]
-X <string> Penalty for mismatch, [6,6]
-O <string> Penalty for gap open, [3,3]
-E <string> Penalty for gap extension, [2,2]
-Q <string> Penalty for gap2 open, [0,0]
-P <string> Penalty for gap2 extension, [0,0]
-G <string> misc parameters for POA, <tag>=<val>
Defaults: seqcore=40,shuffle=1,refmode=0,refbonus=1,nrec=20,kmer=15,trigger=1,realn=3,editbw=32
,althi=5,qlthi=70,psub=0.10,pins=0.10,pdel=0.15,piex=0.15,pdex=0.20,hins=0.20,hdel=0.40
,varcnt=3,covfrq=0.5,snvqlt=5
seqcore: number of seqs in core MSA, addtional reads will be realigned (realn > 0) against core MSA profile to build a full MSA
shuffle: whether to shuffle/sort the reads according to most kmers matches first
refmode: whether the first sequences is reference sequence
refbonus: base match score on reference will be M + refbonus
nrec: every query read is aligning against previous <nrec> reads on graph, 0 to all the previous
trigger: when <trigger> > 0 and <-W> < query length, genrates CNS per after <trigger> reads, and trigger banded alignment
realn: rounds of realignment
editbw=32: bandwidth in edit MSA
althi: cutoff of high alt base quality, used in tidying out possible SNV sites
qlthi: cutoff of high cns base quality, used in print colorful MSA
psub/pins/pdel/piex/pdex: for consensus, probs. of mis/ins/del/ins_ext/del_ext
hins/hdel: probs of ins/del in homopolymer region
varcnt: min count of variant base in SNV sites
covfrq: min spanned_reads / total_reads in SNV sites
snvqlt: - log10(p-value), 5 ~= 1e-5
-L Print MSA in 'one seq one line'
-C Print MSA in colorful format
-R <int> Repeat times (for benchmarking) [1]
-v Verbose
# different sequencing error pattern
tunes psub,pins,pdel,piex,pdex,hins,hdel
# Treats with reads having large offsets with each other in multiple alignment
set '-G trigger=0' to disable banded alignment, will be much much slower
# Many reads
Don't warry, it only take linear time, except that you set a big seqcore value
# false positive SNV
It tends to report as many SNVs as it can, your turn to filter them
> ./bsalign cat -h
Usage: bsalign cat [option] <input fasta/q.gz file>
-o <string> Consensus fasta file, [NULL]
-W <int> Overlap size, [1024]
You can specify individual overlap by add 'overlap=2048' to its seq header
-M <string> Score for match for poa and local realignment [2,1]
-X <string> Penalty for mismatch, [6,2]
-O <string> Penalty for gap open, [3,0]
-E <string> Penalty for gap extension, [2,1]
-v Verbose
テストラン
cd example
sh run.sh
https://github.com/ruanjue/bsalign/blob/master/example/run.shが実行される。
各結果は4行で表示される。
(レポジトリより転載)
1行目: 1列目-RefName; 2列目-RefLength; 3列目-RefStrand; 4列目-RefStart; 5列目-RefEnd; 6列目-QueryName; 7列目-QueryLength; 8列目-QueryStrand; 9列目-QueryStart; 10列目-QueryEnd; 11列目-AlignmentScore; 12列目-Identity、13列目-NumberOfMatch、14列目-NumberOfMismatch、15列目-NumberOfDeletion、16列目-NumberOfInsertion
2行目 : 参照配列
3行目:'|'、'*'、'-'はそれぞれマッチ、ミスマッチ、indelを意味する
4行目:クエリ配列
引用
BSAlign: a library for nucleotide sequence alignment
Haojing Shao, Jue Ruan
bioRxiv, Posted January 17, 2024