リピートの多いゲノム配列にロングリードをマッピングするために最適化されたアライナー Winnowmap

2022/04/02 論文引用

　ヒトゲノムの約5～10％は、セグメント重複やタンデムリピート配列などの繰り返し配列が存在するため、機能解析にアクセスできない状態になっている。高品質な個人ゲノムのリシークエンシングを可能にするためには、リピートを考慮したリードマッピング手法を用いて、エンドツーエンドでのゲノムバリアント発見をサポートすることが重要である。本研究では、既存のロングリードマッパーでは、対立遺伝子バイアスの影響を受けやすく、長い、ほぼ同一のリピート内の不正確なアラインメントやバリアントコールが生じることが多いという事実を明らかにした。リピート内に非リファレンスアリルが存在する場合、標準的なペアワイズ配列アライメントスコアリングシステムが真のバリアントにペナルティを与えるため、その領域からサンプリングされたリードは誤ったリピートコピーにマッピングされる可能性がある。

　上記の問題に対処するために、著者らは、minimal confidently alignable substrings (MCASs)を利用することで対立遺伝子のバイアスに対処する、新しいロングリードマッピング方法を提案する。MCASは、十分なマッピング信頼度（すなわち、ユーザーが指定した閾値以上のマッピング品質スコア）を持つ参照遺伝子座へのユニークなアラインメントを持つリードの最小長さの部分文字列として定式化される。このアプローチでは、各リードのマッピングを確信度の高いサブアラインメントの集合として扱い、構造変化に対してより寛容であり、リピート内のパラログ特異的バリアント（PSV）に対してより敏感である。本研究では、MCASを数学的に定義し、正確なアルゴリズムとその計算のための実用的なヒューリスティックな方法について述べる。提案手法（Winnowmap2と呼ばれる）は、最近完成したヒト染色体Xとヒト8番染色体のギャップレスアセンブリを参照して、シミュレーションと実際の長さのリードベンチマークを用いて評価した。その結果、Winnowmap2が対立遺伝子のバイアスの問題にうまく対処し、反復配列の下流のバリアントコールをより正確に行えることを示した。例として、シミュレーションされたPacBio HiFiリードと8番染色体の構造バリアントを使用した場合、Winnowmap2のアラインメントは、ほぼ同一反復内の構造バリアントコールについて、それぞれminimap2（39.62％、5.88％）とNGMLR（56.60％、36.11％）と比較して、最も低い偽陰性率と偽陽性率（1.89％、1.89％）を達成していた。Winnowmap2のコードは、https://github.com/marbl/Winnowmapで利用できる。

インストール

Github

git clone https://github.com/marbl/Winnowmap.git
cd Winnowmap
make -j8
cd bin/

> ./winnowmap -h

$ ./winnowmap -h

winnowmap is built on top of minimap2, while modifying its minimizer sampling and indexing procedures

Usage: winnowmap [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

Indexing:

-H use homopolymer-compressed k-mer

-k INT k-mer size (no larger than 28) [15]

-w INT minimizer window size [50]

-W FILE input file containing list of high freq. k-mers

-I NUM split index for every ~NUM input bases [4G]

-d FILE dump index to FILE

Mapping:

-f FLOAT filter out top FLOAT (<1) fraction of repetitive minimizers [0.0]

-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]

-G NUM max intron length (effective with -xsplice; changing -r) [200k]

-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]

-r NUM bandwidth used in chaining and DP-based alignment [500]

-n INT minimal number of minimizers on a chain [3]

-m INT minimal chaining score (matching bases minus log gap penalty) [40]

-X skip self and dual mappings (for the all-vs-all mode)

-p FLOAT min secondary-to-primary score ratio [0.8]

--sv-off turn off SV-aware mode

Alignment:

-A INT matching score [2]

-B INT mismatch penalty [4]

-O INT[,INT] gap open penalty [4,24]

-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]

-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]

-s INT minimal peak DP alignment score [80]

-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

Input/Output:

-a output in the SAM format (PAF by default)

-o FILE output alignments to FILE [stdout]

-L write CIGAR with >65535 ops at the CG tag

-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar'

-c output CIGAR in PAF

--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]

--MD output the MD tag

--eqx write =/X CIGAR operators

-Y use soft clipping for supplementary alignments

-t INT manually set pthread count rather than automatically

-K NUM minibatch size for mapping [1000M]

--version show version number

Preset:

-x STR preset (always applied before other options)

- map-ont (ont-to-ref, uses default param)

- map-pb (hifi-to-ref, all defaults but does finer read fragmentation in SV-aware mapping)

- map-pb-clr (clr-to-ref, sets --sv-off)

- splice/splice:hq - long-read/Pacbio-CCS spliced alignment, sets --sv-off

- asm5/asm10/asm20 - asm-to-ref mapping

> ./meryl

$ ./meryl

usage: ./meryl ...

A meryl command line is formed as a series of commands and files, possibly

grouped using square brackets. Each command operates on the file(s) that

are listed after it.

COMMANDS:

print display kmers on the screen as 'kmer<tab>count'. accepts exactly one input.

count Count the occurrences of canonical kmers in the input. must have 'output' specified.

count-forward Count the occurrences of forward kmers in the input. must have 'output' specified.

count-reverse Count the occurrences of reverse kmers in the input. must have 'output' specified.

k=<K> create mers of size K bases (mandatory).

n=<N> expect N mers in the input (optional; for precise memory sizing).

memory=M use no more than (about) M GB memory.

threads=T use no more than T threads.

less-than N return kmers that occur fewer than N times in the input. accepts exactly one input.

greater-than N return kmers that occur more than N times in the input. accepts exactly one input.

equal-to N return kmers that occur exactly N times in the input. accepts exactly one input.

not-equal-to N return kmers that do not occur exactly N times in the input. accepts exactly one input.

increase X add X to the count of each kmer.

decrease X subtract X from the count of each kmer.

multiply X multiply the count of each kmer by X.

divide X divide the count of each kmer by X.

modulo X set the count of each kmer to the remainder of the count divided by X.

union return kmers that occur in any input, set the count to the number of inputs with this kmer.

union-min return kmers that occur in any input, set the count to the minimum count

union-max return kmers that occur in any input, set the count to the maximum count

union-sum return kmers that occur in any input, set the count to the sum of the counts

intersect return kmers that occur in all inputs, set the count to the count in the first input.

intersect-min return kmers that occur in all inputs, set the count to the minimum count.

intersect-max return kmers that occur in all inputs, set the count to the maximum count.

intersect-sum return kmers that occur in all inputs, set the count to the sum of the counts.

difference return kmers that occur in the first input, but none of the other inputs

symmetric-difference return kmers that occur in exactly one input

MODIFIERS:

output O write kmers generated by the present command to an output meryl database O

mandatory for count operations.

EXAMPLES:

Example: Report 22-mers present in at least one of input1.fasta and input2.fasta.

Kmers from each input are saved in meryl databases 'input1' and 'input2',

but the kmers in the union are only reported to the screen.

meryl print \

union \

[count k=22 input1.fasta output input1] \

[count k=22 input2.fasta output input2]

Example: Find the highest count of each kmer present in both files, save the kmers to

database 'maxCount'.

meryl intersect-max input1 input2 output maxCount

Example: Find unique kmers common to both files. Brackets are necessary

on the first 'equal-to' command to prevent the second 'equal-to' from

being used as an input to the first 'equal-to'.

meryl intersect [equal-to 1 input1] equal-to 1 input2

実行方法

１、pre-computing high frequency k-mers using meryl

meryl count k=15 output merylDB ref.fa 
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

２、mapping

#ONT
winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont.fq.gz > output.sam

#Pacbio
winnowmap -W repetitive_k15.txt -ax map-pb ref.fa hifi.fq.gz > output.sam

引用

A long read mapping method for highly repetitive reference sequences

Chirag Jain, Arang Rhie, Nancy Hansen, Sergey Koren, Adam M. Phillippy

bioRxiv, Posted November 02, 2020

2022/04/02

Long-read mapping to repetitive reference sequences using Winnowmap2

Chirag Jain, Arang Rhie, Nancy F. Hansen, Sergey Koren & Adam M. Phillippy

Nature Methods, Published: 01 April 2022