fastqをクラスタリングする QCluster - macでインフォマティクス

　次世代シーケンシング（NGS）技術によって生成されるデータ量は、現在のコンピュータシステムのストレージおよびデータ処理能力に挑戦しているペースで増加している[ref.1]。現在の技術は1回の実行で5千億本以上のDNAを生産し（論文執筆時点）、今後のシーケンサーはこのスループットを向上させることを約束する。シークエンシング技術の急速な進歩により、ゲノムリシークエンシング、RNA-Seq、ChIP-Seqなど多くの異なるシークエンシングベースのアプリケーションが可能になった[ref.2]。このような大きなファイルの取り扱いと処理は、ほとんどのゲノム研究プロジェクトにとって大きな課題の1つになっている。

　シーケンス間の類似性を確立するために、アライメントに基づく方法がかなりの間使用されている[ref.3]。しかし、アライメント方法を適用することができないか、またはそれらが適していない場合がある。例えば全ゲノムの比較は、トラディショナルな従来のアラインメント技術では実施することはできない[ref.4-6]。高速なアライメントヒューリスティックが存在するが、アライメントには通常時間がかかり、次世代シークエンシング技術（NGS）[ref.7,8]によって生成される大規模シーケンスデータには適していないという欠点がある。これらの理由から、多くのアライメントフリー技術が長年提案されている[ref.9]。
　配列を比較するためのアラインメントフリー方法の使用は、異なる用途において有用であることが判明している。いくつかのアライメントフリーの方法は、異なる生物間の進化的関係を研究するためにパターン分布を使用する[ref.4,10,11]。 ChIP-Seqデータ[ref.12-14]およびエントロピープロファイル[ref.15,16]におけるエンハンサー検出のために、いくつかのアライメントフリー方法が考案されている。もう1つのアプリケーションは、遠タンパク質の遠縁関係の分類であり、これは洗練された計数手順で対処することができる[17,18]。 NGSに基づくゲノムのアセンブリフリー比較は、最近になって研究されている[ref.7,8]。包括的なレビューについては、[ref.9 pubmed]を参照してください。
　この研究では、アラインメントフリーリードクラスタリング能力を探求する。クラスタリング技術は、エラー訂正[ref.19]からマイクロRNAの発見[ref.20]まで、多くの異なるアプリケーションで広く使用されている。（一部略）。
Solovyov et al (ref.21) はNGSをアライメントフリーでクラスタリングする最初の比較方法の1つを提示した。彼らは、k-mer数に基づいて異なる遺伝子および異なる種に由来するリードをクラスタリングすることに焦点を合わせた。Dタイプの測定（第2節参照）、特にD2 *が、同じ遺伝子または種からのリードを効率的に検出し、クラスタリングすることができることを示した（誤差に焦点を当てているref.20とは対照的である）。この論文では、これらの尺度にクオリティ情報を組み込むことによって、この研究を拡張する。

　NGSプラットフォームによって生成されるクオリティスコアは、NGSデータのさまざまな分析のための基本的なもので多くのことに関係している：リードをリファレンスゲノムにマッピングする、エラー訂正[ref.19]。挿入および欠失の検出[ref.23]および他の多くが挙げられる。さらに、将来の世代のシークエンシング技術では、誤った塩基が多数存在する、ロングリードが生成される[ref.24]。 1回のリードあたりの平均エラー数は15％まで増加するため、アライメント・フリーのフレームワークとデノボ・アセンブリで、高品質の情報を活用することが基本となる。

（以下略）

インストール

cent os6でテストした。

公式HP

http://www.dei.unipd.it/~ciompin/main/qcluster.html

ZIPをダウンロードして解凍し、makeする。

cd Cluster/
make

> ./qCluster

$ ./qCluster

Missing/extra input file name

Centroid based (k-means-like) clustering of sequences in n-mer frequency space

usage: /Users/kazumaxneo/Downloads/qCluster/qCluster [-h] [-c num_clusters] [-d dist_type] [-e num_method] [-k nmer_length] [-m max_iterations] [-N num_method] [-n] [-P num_method] [-p pseudocount] [-R] [-r] [-S seed] [-t num_trials] [-v] [-w] fastq_file

-c num_clusters (5 clusters by default)

-d dist_type:

a: d2* distance

c: chi square statistic

d: d2 distance

e: regular euclidean (L2) distance; default

k: Kullback-Leibler divergence

s: symmetrized Kullback-Leibler divergence

-e num_method: method for computation of quality expected

value of words:

1: average quality of the word over the full dataset (default)

2: average quality of each base in the same word over the full

dataset and calculate the expected value with Markovian model

3: average quality of each baseover the full dataset and

calculate the expected value with Markovian model

-h print this message and exit

-k nmer_length: length of word (2-mers by default)

-m max_iterations: number of maximum iterations of the algorithm

without an improvement (min. 2)

-N num_method: Divide each quality vector by the taxicab norm so

that the sum of the elements is (about) 1. Possible values:

0: no normalization

1: divide by the norm of the vector itself

2: divide by the norm of the frequency vector (default)

3: as method 2 and project the vector on the hyperplane 1*X-1=0

-n normalize frequency matrix to make each column univariant;

implies L2 distance

-P num_method: method for computation of expected frequancy of words:

1: uses a markovian model to compute the expected frequency of

words basing only on the words of the cluster

(P2Local methond: default)

2: Average frequency of every word over the entire dataset

(P1Global method)

3: Markovian model based on average frequency of single bases

on the full dataset (P2Global method)

-p pseudocount: (default: 1); can be fractional.

Must be nonzero with KL and simmetrized KL distance.

-R do not redistribute missing quality among other bases.

-r reverse complement and stack together

-S seed: initial seed for random number generator

-t num_trials: repeat clustering num_trials, choosing the best

partitioning. Default: 1

-v output progress messages (repeat for increased verbosity)

-w write sequences from each cluster to a file

ラン

テストラン。３つのクラスターに分類する。

QCluster -d a -c 3 -k 3 -t 5 -S 0 -w Example/sequences.fastq

-c num_clusters (5 clusters by default)
-d dist_type:-d dist_type:

a : d2* distance
c : chi square statistic
d : d2 distance
e : regular euclidean (L2) distance; default
k : Kullback-Leibler divergence
s : symmetrized Kullback-Leibler divergence

-t num_trials: repeat clustering num_trials, choosing the best -t num_trials: repeat clustering num_trials, choosing the best partitioning. Default: 1

３つのfastqが出力される。

引用

QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering

Matteo CominAndrea LeoniMichele Schimd

WABI 2014: Algorithms in Bioinformatics pp 1-13