k-merカウントツール Squeakr - macでインフォマティクス

　Massively parallel high-throughput sequencing (HTS) 技術の登場により、シーケンシング能力は劇的に増加している。増加するHTSデータに対処するための新しい計算方法の多くは、k-mer（k塩基の文字列）をシーケンスの分析の最小単位として使用する。例えば、ほとんどのHTSベースのゲノムおよびトランスクリプトームアセンブラは、de-Bruijn graphを構築するためにk-merを使用する（Pevzner et al（2001）; Zerbino and Birney（2008）; Bankevich et al（2012）; Simpson et al（2009）; Grabherr et al（2011）; Schulz et al（2012））。 De-Bruijn graphベースアセンブリは、より伝統的なオーバーレイアウトコンセンサスアセンブリ（Koren et al、2016）で計算負担がかかるオーバーラップのアプローチを排除するため、部分的に好まれる。

　k-merベースの方法は、HTSデータを前処理してエラー訂正を行うためにも頻繁に使用されている（Liu et al、2013; Song et al、2014; Heo et al、2014） ; Zhang et al、2014）。ロングリード（ "第3世代"）ベースのアセンブリでさえ、k-merは、リードのオーバーラップ探索を助け（Berlin et al、2015; Carvalho et al、2016）、エラー修復を行う（Salmela and Rivals、2014; Salmela et al。、2016）ビルディングブロックとして働く。
　k-merに基づく方法は、多くのタイプのHTS分析の計算コストを削減する。これには、RNA-seq（Patro et al、2014; Zhang and Wang、2014）を用いた転写産物定量、メタゲノミックリードのtaxonomyアサイン（Wood and Salzberg、2014; Ounit et al、2015）HTSベースのシーケンシング実験の大規模なリポジトリ検索（Solomon and Kingsford、2016 pubmed）にも当てはまる。

　上に列挙した分析の多くは、シーケンスデータセット中の各k-merの出現数を数えることから始まる。特に、k-merカウントは、シーケンシングエラーによって引き起こされる誤ったデータを排除するために使用される。これらのシーケンスエラーは、ほとんどの場合、「シングルトン」のk-mers（i.e, データセットに1回だけ現れるk-mer）を生じさせ、そしてシングルトンの数は、基礎となるデータセットのサイズに対して直線的に増大する。 k-merカウントはすべてのシングルトンを識別し、シングルトンを削除する。
　k-merカウントは迅速に行う必要があり、データセットが大きく、k-mersの頻度分布が歪んでいることが多いため、簡単ではない。スペース消費、キャッシュローカリティ、複数のスレッドによるスケーラビリティなど、多くの異なる競合するパフォーマンス問題が存在するため、k-merカウンタのシステムアーキテクチャはさまざまである。（一部略）。
　一般に、最近のk-mer-カウント法の焦点は、クエリ性能にあまり重点を置かずに、性能またはメモリ使用量に焦点を当てている。しかし、多くの下流分析では、効率的なクエリをサポートする手法による利点がある。 Squeakrは、パフォーマンスとメモリ使用量で既存ソリューションと比較して優れているか互角であり、さらにより高速なクエリを提供する。ほとんどのアプリケーションでは、カウントとクエリの組み合わせが実行される。多くの場合、クエリはカウントよりも一般的である。 Squeakrはクエリの処理速度が非常に速いため、他のシステムよりもアプリケーション全体の処理速度が向上する。

Squeakrに関するツイート

インストール

ubuntu16.04のminiconda3-4.3.30環境でテストした。

依存

libboost-dev 1.58.0.1ubuntu1
libssl-dev 1.0.2g-1ubuntu4.6
zlib1g-dev 1:1.2.8.dfsg-2ubuntu4
bzip2 1.0.6-8

本体　Github

#Anacondaを使っているならcondaで導入できる
conda install -c bioconda squeakr

> squeakr count

$ squeakr count

SYNOPSIS

squeakr count [-e] -k <k-size> [-c <cutoff>] [-n] [-s <log-slots>] [-t <num-threads>] -o <out-file> <files>...

OPTIONS

-e, --exact squeakr-exact (default is Squeakr approximate)

<k-size> length of k-mers to count

<cutoff> only output k-mers with count greater than or equal to cutoff (default = 1)

-n, --no-counts

only output k-mers and no counts (default = false)

<log-slots> log of number of slots in the CQF. (Size argument is only optional when numthreads is exactly 1.)

<num-threads>

number of threads to use to count (default = number of hardware threads)

<out-file> file in which output should be written

<files>... list of files to be counted (supported files: fastq and compressed gzip or bzip2 fastq files)

> squeakr query

$ squeakr query

SYNOPSIS

squeakr query -f <squeakr-file> -q <query-file> -o <output-file>

OPTIONS

<squeakr-file>

input squeakr file

<query-file>

input query file

<output-file>

output file

> squeakr inner_prod

$ squeakr inner_prod

SYNOPSIS

squeakr inner_prod <first-input> <second-input>

OPTIONS

<first-input>

first input squeakr file

<second-input>

second input squeakr file

> squeakr list

$ squeakr list

SYNOPSIS

squeakr list -f <squeakr-file> -o <output-file>

OPTIONS

<squeakr-file>

input squeakr file

<output-file>

output file

テストラン

squeakr-count

fastqを指定してk-merをカウントする。

git clone https://github.com/splatlab/squeakr.git
cd squeakr/
squeakr count -e -k 28 -s 20 -t 1 -o data/tmp.squeakr data/test.fastq

-g gzip compressed fastq
-b bzip2 compressed fastq
-t number of threads to use to count (default = number of hardware threads)

k-mer配列を出力

squeakr queryt -f data/tmp.squeakr -q data/query_file -o data/query.output

引用

Squeakr: an exact and approximate k-mer counting system

Pandey P, Bender MA, Johnson R, Patro R, Berger B.

Bioinformatics. 2018 Feb 15;34(4):568-575.