エラーを除去しながらペアリードをマージする CASPER

ペアエンドシーケンスからのフォワードリードとリバースリードのマージは、ゲノムアセンブリやマッピングなどのダウンストリームタスクのパフォーマンスを大幅に向上させる（インサートサイズの問題には触れない）。しかしエラー率はシーケンス限界が近づくと急速に増大するため、このエラーがペアを正確にマージするための重大なハードルになる。

　CASPERは重複したペア・エンド・リードを迅速かつ確実にマージするツール。正確性と堅牢性の両点で既存のペアエンドマージツールよりも大幅に優れているとされる。マルチスレッドに対応しており、高速に動作する。

CASPERは、マージのためにクオリティスコアとk-meスペクトルを使っている。オーバーラップするペアリードにミスマッチがありそのクオリティスコアの差が大きい場合、CASPERはクオリティスコアの高い方を信用して修正する（BBtoolsも同じ）。そうでない場合、代わりに、CASPERはミスマッチ近傍の配列のk-merを調べ、統計的に決定を下す。

公式より転載

公式サイト

http://best.snu.ac.kr/casper/

インストール

cent OSに導入した。

依存

jellyfish (version 2.2.3 or higher)

公式からソースコードをダウンロードする。

tar xvf casper_v0.8.2.tar
cd casper_v0.8.2/
make
./casper -h #ヘルプ

$ casper -h

=============================================================================

Usage : CASPER forward.fastq reverse.fastq [OPTIONS]

[MANDATORY]

Input forward side FASTQ file first.

Input reverse side FASTQ file following the forward file.

[OPTIONS]

-t <int> The number of threads for parallel proessing

(default=40 up to maximum number of system limit)

-k <int> The size of k-mers used to represent contexts around

mismatching bases. (default=17)

-d <int> Threshold for difference of quality-scores

Context-based mismatch resolution starts if quality scores

differ less than 'd'.

Smaller value indicates more trust to quality scores than k-mer context.

(default=19)

-g <float> Threshold for mismatch ratio of best overlap region

CASPER gives up merging if the mismatch ratio in the overlap

is greater than 'g' and leaves the two reads unmerged.

If all the reads have overlap then set 'g' as default or higher.

Or if you want sensitive for not merging(TN) then set 'g' as

lower than default. (0.27 or lower is recommended)

(default=0.5)

-w <int> The minimum length (in bp) of the overlap between forward

and reverse reads. (default=10bp)

-o <str> Prefix of output (default=casper)

By default, 'casper.fastq' <- merged output is generated.

-j Internal naive k-mer counting method is used instead of Jellyfish.

By default (without this option), Jellyfish (for k-mer counting)

is used to speed up.

-l CASPER can generate the unmerged output file.

prefix_for_left.fastq, prefix_rev_left.fastq for forward, reverse

individually.

-h Help for usage information

-v Version information

* CASPER do not need PHRED offset. Either PHRED+64 or PHRED+33 is OK.

Only the difference between two quality scores instead of absolute

value is used.

[Examples]

case1: using Jellyfish, output prefix is out, k-mer=19, threads=6,

$ casper forward.fastq reverse.fastq -o out -k 19 -t 6

case2: without Jellyfish, give up threshold=0.27

$ casper forward.fastq reverse.fastq -j -g 0.27

=============================================================================

パスの通ったディレクリに移動しておく。

ラン

公式のテストデータ（シミュレーションとリアルデータ）の１つをダウンロードし、ランする。ここではA4をダウンロードした。解凍し、中に入ってラン。

wget http://best.snu.ac.kr/casper/data/A4.tar.gz
tar zxvf A4.tar.gz
cd A4/

#実行
casper A4_1.fastq A4_2.fastq -o out -k 19 -t 6

-o Prefix of output (default=casper) By default, 'casper.fastq' <- merged output is generated.
-k The size of k-mers used to represent contexts around mismatching bases. (default=17)
-t The number of threads for parallel proessing (default=40 up to maximum number of system limit)

$ casper A4_1.fastq A4_2.fastq -o out -k 19 -t 6

=============================================================================

[CASPER] Context-Aware Scheme for Paired-End Read

Input Files

- Forward file : A4_1.fastq

- Reverse file : A4_2.fastq

Parameters

- Number of threads for parallel processing : 6

- K-mer size : 19

- Threshold for difference of quality score : 19

- Threshold for mismatching ratio : 0.5

- Minimum length of overlap : 10

- Using Jellyfish : true

K-mers : Jellyfish

- jellyfish count -m 19 -L 2 -o outjellykmer -c 3 -s 10M -t 6 A4_1.fastq

- jellyfish count -m 19 -L 2 -o outjellykmer -c 3 -s 10M -t 6 A4_2.fastq

Output Files

- Merged output file : out.fastq

Merging Result Statistics

- Total number of reads : 1000000

- Number of merged reads : 999936 (99.99%)

- Number of unmerged reads : 64 (0.01%)

- TIME for total processing : 28.012 sec

=============================================================================

99.99%のリードがマージされた。

16Sアンプリコンシーケンスにも利用できますが、16Sについては。ここ２年で、CASPERも組み込んでさらに洗練されたツールが発表されてきています。そちらも検討してみてください（例えばMeFit）。また、扱っているのが古代のDNAサンプルであれば、よりエラーに強いleeHom等を検討してみてください（リンク）。

追記

上のデータをBBtoolsのマージコマンドでもテストしましたが（マージが最大になるlooseオプション使用）、マージするリードの数はCASPERのほうが相当多くなりました。感度と精度はトレードオフになりやすいので、どんなデータでも万能なマージとはならないでしょうが（例えばserial direct repeat）、インサートサイズが短いデータなら、マージはフローに組み込んでもいいかもしれません。

引用

CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing.

Kwon S, Lee B, Yoon S.

BMC Bioinformatics. 2014;15 Suppl 9:S10.