高頻度なk-merを効率的にカウントする Turtle

　k-merを用いたde Bruijnグラフ構造は今日普及しているゲノムアセンブルの中核であり、多くの方法論で使われている。k-merはCeleraのようなOLCのアセンブルツールでも重複のシードを用いるのに使われている。また、いくつかのエラー訂正ツールは、k-merの頻度を分析してエラー訂正を行う。すなわち複数回出現するk-merをゲノムの真の配列に由来するものと考え、対照的に１回しか出現しないk-merをシーケンスエラーとみなす。

　サイズがgのゲノムでは、最大g回までのユニークなk-merが期待される。ただしこの数は重複により減少する。k-merが小さくなっても、偶然ユニークでない可能性が高まるため、減ることになる（5merならATGCの全組み合わせは4^5しかない）。しかしながら、シーケンスデータのユニークなk-merを全て数え上げると、ゲノムサイズから期待されるユニークなk-mer数よりずっと多くなる。この理由は、シーケンスエラーが直感的に理解するより頻繁に起きているからである。例えばphread quality scoreが平均30のデータからk=31でk-merを数え上げることを考える。シーケンスエラーが偶然起きる確率は１つの部位で0.1%なので、正解率は99.9%、すなわち31mer連続してエラーが起きない可能性は96.6%になる。裏を返せば3.4%はシーケンスエラーをどこかに含んでいる。クオリティが20なら、シーケンスエラーがゼロの31merは全体の73%まで落ちる。ここに、ゲノムのレアなバリアント、ハプロタイプ、コンタミなども乗ってくるため、ゲノムサイズから予想される値よりずっと大きくなる。よって、ゲノムサイズを推測するには、低頻度なk-merを除きカウントする必要がある。

　Turtleは高頻度なk-merの数をBloom filterを使って省メモリで計算する方法論。並列化にも対応しており、巨大なデータのk-merを少ないリソースで計算するすることができる。具体的には、135.3Gbのヒトゲノムデータの31-merのカウントを、20スレッド使用時に2時間未満で終えるとされる。

インストール

cent OSに導入した。

公式サイトからダウンロードする。

http://bioinformatics.rutgers.edu/Software/Turtle/

> scTurtle64

$ scTurtle64

This program comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it under certain conditions. For details see the document COPYING.

Parameters received:

Please specify an input file.

scTurtle64 Usage:

scTurtle64 [arguments]

example: ./scTurtle64 -f 1Mreads.fq -o kmer_counts -k 31 -n 6000000 -t 9

-i input reads file in fasta format.

-f input reads file in fastq format. This is mutually exclusive with -i.

-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.

-q ouput files prefix. k-mers and their counts are stored in tab delimited fromat (quake compatible) in multiple files named prefix0, prefix1... which the user can concatenate if desired.

-k k-mer length.

-t Number of threads.

-n Expected number of frequent k-mers. For uniform coverage libraries this is usually close to genome length. For single-cell libraries, 2-3 times the gemome length is recommended.

-s The approximate amount of space (in GB) to be used. It is used to indirectly compute -n and is mutually exclusive with -n. When both -n and -s are specified, the one that appears last is used.

-h Print this help menu.

-v Print software version.

パスを通しておく。

ラン

scTurtle32 -f reads.fq -o kmer_counts -k 31 -n 3900000 -t 8

-i input reads file in fasta format.-i input reads file in fasta format.
-f input reads file in fastq format. This is mutually exclusive with -i.
-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.
-q ouput files prefix. k-mers and their counts are stored in tab delimited fromat (quake compatible) in multiple files named prefix0, prefix1... which the user can concatenate if desired.
-k k-mer length.
-t Number of threads.
-n Expected number of frequent k-mers. For uniform coverage libraries this is usually close to genome length. For single-cell libraries, 2-3 times the gemome length is recommended.