DNAエンリッチメントのためのベイトデザインをスケーラブルに行う Syotti

　ベイトエンリッチメントは、メタゲノム試料中の目的領域の増幅に成功したことから、広く普及しつつあるプロトコルである。この方法では、一組の合成プローブ（「ベイト」）を設計・製造し、断片化したメタゲノムDNAに適用する。プローブは断片化したDNAに結合し、結合しなかったDNAは洗い流され、結合した断片は配列決定のために増幅される。Metskyらは、ベイトエンリッチメントがメタゲノム・サンプル内の多数のヒトウイルス病原体を検出できることを実証した。

　最小ベイトカバー問題を定義することにより、ベイトの設計問題を定式化し、この問題が非常に限定的な仮定の下でもNP困難であることを示し、簡潔なデータ構造を利用した効率的なヒューリスティックを設計した。本手法はSyottiと呼ばれる。Syottiの実行時間は、Metskyらの方法を含む最新の方法よりも少なくとも1桁速く、実際に線形スケーリングを示す。同時に、本方法は、競合する方法が生成するbait setsよりも小さく、また、カバーされていない位置がより少ない。一方、Metskyらの手法は、明らかに超線形の実行時間を示し、72時間以内に17%のデータのサブセットさえ処理できないことが分かった。

　サンプル内の微生物集団全体のシーケンシングではなく、ターゲットシーケンシングによって関心のある配列を選択することができる。これは、ビオチン化cDNAベイト分子を用いて、DNAライブラリから特定の領域をターゲットとして配列決定を行うターゲットエンリッチメントによって実現できる。具体的には、ベイト分子（合成された短い一本鎖cDNA分子）はDNAのターゲットに結合し、磁石を使ってサンプル内に捕捉される。捕獲されなかったDNA断片（＝非標的DNA）は洗い流される。こうして、結合したDNA（標的DNA）のみがシークエンシングされる。非標的DNAは完全に排除されるわけではないが、大幅に減少する。このプロセスの最初の、そして間違いなく最も重要なステップの1つは、計算問題を解くことである。与えられたターゲットDNA配列のセット（例えば、遺伝子またはウイルス株のセット）と指定されたベイト長kに対して、データベース内のすべての配列の各位置に結合するベイトがそのセット内に少なくとも1つ存在するように、ベイトのセットを特定することである。

　当初は、標的DNAに含まれる長さkの部分配列（k-mers）をすべて見つけることによって、ベイトを計算機で設計することができると予想された。しかし、この方法は、データセットのサイズが大きくなるにつれてk-mersの数が急激に増加するため、実現不可能である。そのため、効果的なベイトデザインを行うためには、2つの重要な課題に取り組む必要がある。（以下略）

インストール

Github

git clone https://github.com/jnalanko/syotti.git
cd syotti/
git submodule init
git submodule update
cd sdsl-lite
sh install.sh
cd ..
make toolkit

> ./bin/greedy

Computes a greedy bait cover.

Usage:

./bin/greedy [OPTION...]

-L, --bait-len arg Length of the baits. (default: 120)

-d, --hamming-distance arg Number of allowed mismatches in the baits.

(default: 40)

-s, --sequences arg Path to a fasta file of the input sequences.

(default: "")

--fm-index-out arg The algorithm is based on FM-index, which we

build at the start. Building the index can take a

lot of time and memory. Use this option to save

the FM-index to disk so that you can later run

the algorithm with different parameters re-using

the same FM-index. (optional). (default: "")

-f, --fm-index arg Path to a previously saved FM-index on disk

(--fm-index-out). This option loads the FM index

from disk instead of building it again. (default:

"")

-o, --out arg Filename prefix for the output files. (default:

"")

-r, --randomize Randomize the processing order of the sequences

in the greedy algorithm.

-t, --n-threads arg Maximum number of parallel threads. The program

is not very well optimized for parallel

processing, so don't expect much of a speedup here.

(default: 1)

-c, --cutoff arg Stop the greedy algorithm after this fraction

of positions is covered. For example: 0.99.

(default: 1)

-g, --seed-len arg The length of the seeds in the FM-index

seed-and-extend approximate string search subroutine. A

lower value will find more matches, but will be

slower. (default: 20)

-h, --help Print instructions.

> ./bin/build_FM_index

Build an FM index.

Usage:

./bin/build_FM_index [OPTION...]

-s, --sequences arg Path to the fasta file of the input sequences.

(default: "")

-o, --output arg Path to the output FM-index file (default: "")

-h, --help Print instructions.

> ./bin/fill_gaps

Fills gaps in a given bait cover.

Usage:

./bin/fill_gaps [OPTION...]

-G, --max-gap arg Maximum allowable gap length. (default: 0)

-d, --hamming-distance arg Number of allowed mismatches in the baits.

(default: 40)

-s, --sequences arg Path to the fasta file of the input sequences.

(default: "")

-b, --baits arg Path to the fasta file of the baits. (default:

"")

-c, --cover-marks arg Path to the file of the cover marks created by

the greedy algorithm. (default: "")

-o, --out arg Output filename (fasta) (default: "")

--fm-index-out arg The algorithm is based on FM-index, which we

build at the start. Building the index can take a

lot of time and memory. Use this option to save

the FM-index to disk so that you can later run

the algorithm with different parameters re-using

the same FM-index. (optional). (default: "")

-f, --fm-index arg Path to a previously saved FM-index on disk

(--fm-index-out). This option loads the FM index

from disk instead of building it again. (default:

"")

-t, --n-threads arg Maximum number of parallel threads. The program

is not very well optimized for parallel

processing, so don't expect much of a speedup here.

(default: 1)

-g, --seed-len arg Seed and extend g-mer seed length (default: 20)

-v, --verbose Print debug output

-h, --help Print instructions.

テストラン

１、ツールのほとんどは、FM-indexを必要とする。build_FM_indexコマンドで作成する。テストのために3つの大腸菌ゲノムを含むファイルが提供されているので、これを指定する。

./bin/build_FM_index -s testcases/coli3.fna -o coli3.fmi

coli3.fmiが出力される。

2、coli3.fmiに格納されているFM-indexを用いて、testcases/coli3.fnaの少なくとも98パーセントをカバーし、40ミスマッチまで許容する長さ120のベイトセットを計算する。

./bin/greedy -L 120 -d 40 -c 0.98 -s testcases/coli3.fna -f coli3.fmi -o output

-L Length of the baits. (default: 120)
-d Number of allowed mismatches in the baits. (default: 40)
-c Stop the greedy algorithm after this fraction of positions is covered. For example: 0.99. (default: 1)

３つのファイルが出力される。

-baits.fna fastaは作成された長さ120塩基のベイト配列。-cover-fractions.txt は各ベイトについて1行ずつ記載したテキストファイル。詳細はレポジトリで説明されています。また、他にもいくつかのコマンドがあります。

引用

Syotti: scalable bait design for DNA enrichment
Jarno N Alanko, Ilya B Slizovskiy, Daniel Lokshtanov, Travis Gagie, Noelle R Noyes, Christina Boucher

Bioinformatics. 2022 Jun 24;38(Suppl 1):i177-i184