効果的な配列類似性検出を行うStrobemers - macでインフォマティクス

　k-merベースの手法は、バイオインフォマティクスにおいて様々なタイプの配列比較に広く用いられている。しかし、1回の変異でk個の連続したk-merが変異するため、配列比較のためのほとんどのk-merベースのアプリケーションは、変動する変異率に敏感に反応してしまう。この感度を克服するために、spaced k-merやk-mer permutation技術など、多くの技術が研究されてきたが、これらの技術はindelをうまく扱えない。Indelに対しては、小さなk-merのペアやグループが一般的に用いられるが、これらの手法では、まずk-merのマッチが生成され、2段階目で初めてk-merのペアリングやグループ化が行われる。このような手法では、kの大きさに起因する多くの冗長なk-merマッチが生じる。ここでは、k-merに代わる配列比較手法としてStrobemersを提案する。Strobemersは、2つ以上の短いk-merを連結したものであり、連結されたk-merの組み合わせは、ハッシュ関数によって決定される。Strobemersは、k-merやspaced k-merに比べて、より均等に配列が一致し、変異率の違いにも影響されにくいことを、シミュレーションデータを用いて示した。また、Strobemersは配列間のマッチカバレッジが高い。さらに、StrobeMapという概念実証用の配列マッチングツールを実装し、Oxford Nanopore社の合成および生物学的な配列データを用いて、配列クラスタリングやアライメントシナリオなどの異なるコンテキストにおける配列比較にStrobemersを使用することの有用性を示す。

My paper on strobemers is now available in @genomeresearch https://t.co/R1El1UXKiG. This is hopefully the start of some exciting research on strobemers and similar types of constructs. I added some results since the preprint version:

[1/8]
— Kristoffer Sahlin (@krsahlin) October 20, 2021

インストール

Github

#StrobeMap (Strobemersの概念実証)
wget https://github.com/ksahlin/strobemers/raw/main/strobemers_cpp/binaries/Linux/StrobeMap-0.0.2
chmod +x StrobeMap-0.0.2
./StrobeMap-0.0.2  # test program

$ ./StrobeMap-0.0.2

StrobeMap VERSION 0.0.2

StrobeMap [options] <references.fasta> <queries.fast[a/q]>

options:

-n INT number of strobes [2]

-k INT strobe length, limited to 32 [20]

-v INT strobe w_min offset [k+1]

-w INT strobe w_max offset [70]

-t INT number of threads [3]

-o name of output tsv-file [output.tsv]

-c Choice of protocol to use; kmers, minstrobes, hybridstrobes, randstrobes [randstrobes].

-C UINT Mask (do not process) strobemer hits with count larger than C [1000]

-L UINT Print at most L NAMs per query [1000]. Will print the NAMs with highest score S = n_strobemer_hits * query_span.

-S Sort output NAMs for each query based on score. Default is to sort first by ref ID, then by query coordinate, then by reference coordinate.

-s Split output into one file per thread and forward/reverse complement mappings.

This option is used to generate format compatible with uLTRA long-read RNA aligner and requires

option -o to be specified as a folder path to uLTRA output directory, e.g., -o /my/path/to/uLTRA_output/

実行方法

リファレンスの配列（ゲノムなど）と、クエリの配列（変異があるターゲットのゲノム配列など）を指定する。

StrobeMap -k 30 -n 3 -v 31 -w 60 -c randstrobes -o mapped.tsv ref.fa query.fa

-k INT strobe length, limited to 32 [20]
-n INT number of strobes [2]
-v INT strobe w_min offset [k+1]
-w INT strobe w_max offset [70]
-c Choice of protocol to use; kmers, minstrobes, hybridstrobes, randstrobes [randstrobes].
-o name of output tsv-file [output.tsv]