高速なk-merカウントツール KMC3 - macでインフォマティクス

2019 9/11 インストール追記

KMCは高速なk-merカウントの方法論。初代KMC、KMC2、KMC3が発表されている。ここではversion3のKMC3について記載する。ヒトゲノムの619GBのgz圧縮fastqを89分で分析できたと書かれている（2.3GHzの12コア、HDD２台のストライピング読み書き）(注1)。

インストール

公式HP mac、windows、linuxのバイナリが用意されている。

http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download

./kmc #動作確認

#bioconda (link)
conda install -c bioconda -y kmc

$ kmc -h

K-Mer Counter (KMC) ver. 3.0.0 (2017-01-28)

Usage:

kmc [options] <input_file_name> <output_file_name> <working_directory>

kmc [options] <@input_file_names> <output_file_name> <working_directory>

Parameters:

input_file_name - single file in FASTQ format (gziped or not)

@input_file_names - file name with list of input files in FASTQ format (gziped or not)

Options:

-v - verbose mode (shows all parameter settings); default: false

-k<len> - k-mer length (k from 1 to 256; default: 25)

-m<size> - max amount of RAM in GB (from 1 to 1024); default: 12

-sm - use strict memory mode (memory limit from -m<n> switch will not be exceeded)

-p<par> - signature length (5, 6, 7, 8, 9, 10, 11); default: 9

-f<a/q/m> - input in FASTA format (-fa), FASTQ format (-fq) or multi FASTA (-fm); default: FASTQ

-ci<value> - exclude k-mers occurring less than <value> times (default: 2)

-cs<value> - maximal value of a counter (default: 255)

-cx<value> - exclude k-mers occurring more of than <value> times (default: 1e9)

-b - turn off transformation of k-mers into canonical form

-r - turn on RAM-only mode

-n<value> - number of bins

-t<value> - total number of threads (default: no. of CPU cores)

-sf<value> - number of FASTQ reading threads

-sp<value> - number of splitting threads

-sr<value> - number of threads for 2nd stage

Example:

kmc -k27 -m24 NA19238.fastq NA.res \data\kmc_tmp_dir\

kmc -k27 -m24 @files.lst NA.res \data\kmc_tmp_dir\

$ kmc_dump -h

KMC dump ver. 3.0.0 (2017-01-28)

Usage:

kmc_dump [options] <kmc_database> <output_file>

Parameters:

<kmc_database> - kmer_counter's output

Options:

-ci<value> - exclude k-mers occurring less than <value> times

-cx<value> - exclude k-mers occurring more of than <value> times

$ kmc_tools

kmc_tools ver. 3.0.0 (2017-01-28)

Usage:

kmc_tools [global parameters] <operation> [operation parameters]

Available operations:

transform - transforms single KMC's database

simple - performs set operation on two KMC's databases

complex - performs set operation on multiple KMC's databases

filter - filter out reads with too small number of k-mers

global parameters:

-t<value> - total number of threads (default: no. of CPU cores)

-v - enable verbose mode (shows some information) (default: false)

-hp - hide percentage progress (default: false)

Example:

kmc_tools simple db1 -ci3 db2 -ci5 -cx300 union db1_union_db2 -ci10

For detailed help of concrete operation type operation name without parameters:

kmc_tools simple

brewでもversion3が導入できる。

実行方法

k=27のカウント。ペアエンドなら先にコンカテネートしておく（cat *fq > merggd.fq) 。

mkdir working_directory #先にtempoを作らないとエラーになる
kmc -k27 -m24 -ci1 input.fq database working_directory

-k <len>　k-mer length (k from 1 to 256; default: 25)
-m <size>　max amount of RAM in GB (from 1 to 1024); default: 12
-ci <value> exclude k-mers occurring less than <value> times (default: 2)

Stage 1: 100%

Stage 2: 100%

1st stage: 0.636184s

2nd stage: 2.13204s

Total : 2.76823s

Tmp size : 5MB

Stats:

No. of k-mers below min. threshold : 0

No. of k-mers above max. threshold : 0

No. of unique k-mers : 100420

No. of unique counted k-mers : 100420

Total no. of k-mers : 4400000

Total no. of reads : 100000

Total no. of super-k-mers : 511250

データベースdatabaseが作成される。 No. of k-mers below minは-ciの指定以下のk-merの数である。-ci1にしておけばゼロになる。

データベースを元に、テキスト形式で27-merの頻度を書き出す。

kmc_dump database count
head count #開く

$ head count

AAAAAAACAGGCGAAATCATCGATTAA 15

AAAAAAAGCCTGCGTTCCCCCGGTTGC 13

AAAAAAATACTAGGGGTTTGGGATGAA 23

AAAAAAATCAAGCTCTGGGGTTGGTGC 19

AAAAAAATCCGGTTCAACAGGCGATCG 13

AAAAAAATTACTAGGGGTTTCGAGAAC 21

AAAAAAATTCTTATTGCCCTGGTTAAA 11

AAAAAAATTTGGAATGATGGCAGTTAT 12

AAAAAACAGGCGAAATCATCGATTAAT 15

AAAAAACCAATAATGGTGGCATAGGCT 24

頻度3以下だけ数える。

 kmc_dump -cx3 database low_k-mer_output

-ci <value> - exclude k-mers occurring less than <value> times
-cx <value> - exclude k-mers occurring more of than <value> times

もう１つのコマンドkmc_toolsを試す。ここでは低頻度なk-merを含むfastqを除去するフィルタリングを実行する。remove low k-mer containing fastq)。

頻度が50-100%のk-merのfastqを取り出す。データベースのパラメータ（ciとcx）と、入力fastqのパラメータ（ciとcx）を個別に指定する。つまりデータベースとは異なるfastqのlow-kmer fastqを除去することもできる。

kmc_tools filter database -ci3 input.fq -ci0.1 -cx1.0 filtered.fq

データベースのパラメータ

-ci <value> exclude k-mers occurring less than <value> times
-cx <value> exclude k-mers occurring more of than <value> times

入力fastqのパラメータ（ -ci0.1 -cx0.8の部分）

-ci <value> remove reads containing less k-mers than value. It can be integer or floating number in range [0.0;1.0] (default: 2)
-cx <value> remove reads containing more k-mers than value. It can be integer or floating number in range [0.0;1.0] (default: 1e9)
-t trim reads on first invalid k-mer instead of remove entirely

filtered.fqが出力される。-tでトリムも可能。

レアなk-merを除くなら、 -ci0.5位が妥当のように感じます。

自分自身のデータベースを使い、低頻度なk-merのfastqを除く。

kmc_tools filter database -ci10 input.fq filtered.fq

300bpのシーケンスで、500MBくらいのファイルサイズなら-ci10でわずかなコンタミはかなり消せます。ただし、シーケンスエラーを持っていたり（phred quality scoreが平均30の100-bpのリードでも、10本中１本くらいはシーケンスエラーを持つ）、シーケンスバイアスでlow coverageな領域のリードももなくなる可能性が高いことに注意してください。

エラーの多いfastqをエラー訂正し、除ききれなかったlow k-merをKMCで除く２段階クリーニングも可能かもしれない（高品質データでデータベースhigh_quality_dbを作成し、それと照合する）。

macでエラーが出るなら、ulimitの値を変えることで、いちおう対処療法的に修復可能。