2022-03-29

ロングリードを高効率に圧縮する CoLoRd

今日のゲノム研究において、シーケンサー実験によって毎年生み出されるエクサバイト級のデータを維持するためのコストが大きな問題となっている。第三世代シーケンサーの普及にもかかわらず、ロングリードを圧縮する既存のアルゴリズムは、汎用のgzipに対してわずかな優位性しか示していない。本発表では、第三世代シーケンサーのデータサイズを、解析の精度に影響を与えることなく、一桁以上削減できるアルゴリズム、CoLoRdを紹介する。

インストール

Github

#conda (link)
mamba install -c bioconda colord

#from source
git clone https://github.com/refresh-bio/colord
cd colord && make
cd bin

#macOSでは使用前にファイルディスクリプタの上限を増やすことが推奨されている
ulimit -n 2048

> colord

CoLoRd: Compressing long reads

version: 1.1.0

Usage: colord [OPTIONS] SUBCOMMAND

Options:

-h,--help Print this help message and exit

Subcommands:

compress-ont compress ONT data

compress-pbraw compress PacBio Raw data

compress-pbhifi compress PacBio HiFi data

decompress decompression mode

info print archive informations

> colord compress-ont -h

compress ONT data

Usage: colord compress-ont [OPTIONS] input output

Positionals:

input TEXT:FILE REQUIRED input FASTQ/FASTA path (gzipped or not)

output TEXT REQUIRED archive path

Options:

-h,--help Print this help message and exit

-k,--kmer-len UINT:INT in [15 - 28]

k-mer length (default: auto adjust)

-t,--threads UINT=32 number of threads

-p,--priority TEXT:{balanced,memory,ratio}

compression quality

-q,--qual TEXT:{2-avg,2-fix,4-avg,4-fix,5-avg,5-fix,avg,none,org}=4-avg

quality compression mode

* org - original,

* none - discard (Q1 for all bases),

* avg - average over entire file,

* 2-fix, 4-fix, 5-fix - 2/4/5 bins with fixed representatives,

* 2-avg, 4-avg, 5-avg - 2/4/5 bins with averages as representatives.

-T,--qual-thresholds UINT ...

quality thresholds,

* single value for '2-fix' mode (default is 7),

* single value for '2-avg' mode (default is 7),

* three values for '4-fix' mode (default is 7 14 26),

* three values for '4-avg' mode (default is 7 14 26),

* four values for '5-fix' mode (default is 7 14 26 93),

* four values for '5-avg' mode (default is 7 14 26 93),

* not allowed for 'avg', 'org' and 'none' modes

-D,--qual-values UINT ...

quality values for decompression,

* single value for 'none' mode (default is 0),

* two values for '2-fix' mode (default is 1 13),

* four values for '4-fix' mode (default is 3 10 18 35),

* five values for '5-fix' mode (default is 3 10 18 35 93),

* not allowed for 'avg', 'org', '2-avg', '4-avg' and '5-avg' modes

-G,--reference-genome TEXT:FILE

optional reference genome path (multi-FASTA gzipped or not), enables reference-based mode which provides better compression ratios

-s,--store-reference stores the reference genome in the archive, use only with `-G` flag

-v,--verbose verbose

-a,--anchor-len UINT anchor len (default: auto adjust)

-L,--Lowest-count UINT=4 minimal k-mer count

-H,--Highest-count UINT=80 maximal k-mer count

-f,--filter-modulo UINT=12 k-mers for which hash(k-mer) mod f != 0 will be filtered out before graph building

-c,--max-candidates UINT:POSITIVE=5

maximal number of reference reads considered as reference

-e,--edit-script-mult FLOAT:POSITIVE=1

multipier for predicted cost of storing read part as edit script

-r,--max-recurence-level UINT:NONNEGATIVE=3

maximal level of recurence when considering alternative reference reads

--min-to-alt UINT:POSITIVE=64

minimum length of encoding part to consider using alternative read

--min-mmer-frac FLOAT:NONNEGATIVE=0.5

if A is set of m-mers in encode read R then read is refused from encoding if |A| < min-mmer-frac * len(R)

--min-mmer-force-enc FLOAT:NONNEGATIVE=0.9

if A is set of m-mers in encode read R then read is accepted to encoding always if |A| > min-mmer-force-enc * len(R)

--max-matches-mult FLOAT:NONNEGATIVE=10

if the number of matches between encode read R and reference read is r, then read is refused from encoding if r > max-matches-mult * len(R)

--fill-factor-filtered-kmers FLOAT:FLOAT in [0.1 - 0.99]=0.75

fill factor of filtered k-mers hash table

--fill-factor-kmers-to-reads FLOAT:FLOAT in [0.1 - 0.99]=0.8

fill factor of k-mers to reads hash table

--min-anchors UINT=1 if number of anchors common to encode read and reference candidate is lower than minAnchors candidate is refused

-i,--identifier TEXT:{main,none,org}=org

header compression mode

-R,--Ref-reads-mode TEXT:{all,sparse}=sparse

reference reads mode

-g,--sparse-range FLOAT=1 sparse mode range. The propability of reference read acceptance is 1/pow(id/range_reads, exponent), where range_reads is determined based on the number of symbols, which in turn is determined by the number of trusted unique k-mers (estimated genome length) multiplied by the value of this parameter

-x,--sparse-exponent FLOAT=1

sparse mode exponent

> colord compress-pbraw -h

compress PacBio Raw data

Usage: colord compress-pbraw [OPTIONS] input output

Positionals:

input TEXT:FILE REQUIRED input FASTQ/FASTA path (gzipped or not)

output TEXT REQUIRED archive path

Options:

-h,--help Print this help message and exit

-k,--kmer-len UINT:INT in [15 - 28]

k-mer length (default: auto adjust)

-t,--threads UINT=32 number of threads

-p,--priority TEXT:{balanced,memory,ratio}

compression quality

-q,--qual TEXT:{2-avg,2-fix,4-avg,4-fix,5-avg,5-fix,avg,none,org}=none

quality compression mode

* org - original,

* none - discard (Q1 for all bases),

* avg - average over entire file,

* 2-fix, 4-fix, 5-fix - 2/4/5 bins with fixed representatives,

* 2-avg, 4-avg, 5-avg - 2/4/5 bins with averages as representatives.

-T,--qual-thresholds UINT ...

quality thresholds,

* single value for '2-fix' mode (default is 7),

* single value for '2-avg' mode (default is 7),

* three values for '4-fix' mode (default is 7 14 26),

* three values for '4-avg' mode (default is 7 14 26),

* four values for '5-fix' mode (default is 7 14 26 93),

* four values for '5-avg' mode (default is 7 14 26 93),

* not allowed for 'avg', 'org' and 'none' modes

-D,--qual-values UINT ...

quality values for decompression,

* single value for 'none' mode (default is 0),

* two values for '2-fix' mode (default is 1 13),

* four values for '4-fix' mode (default is 3 10 18 35),

* five values for '5-fix' mode (default is 3 10 18 35 93),

* not allowed for 'avg', 'org', '2-avg', '4-avg' and '5-avg' modes

-G,--reference-genome TEXT:FILE

optional reference genome path (multi-FASTA gzipped or not), enables reference-based mode which provides better compression ratios

-s,--store-reference stores the reference genome in the archive, use only with `-G` flag

-v,--verbose verbose

-a,--anchor-len UINT anchor len (default: auto adjust)

-L,--Lowest-count UINT=4 minimal k-mer count

-H,--Highest-count UINT=80 maximal k-mer count

-f,--filter-modulo UINT=12 k-mers for which hash(k-mer) mod f != 0 will be filtered out before graph building

-c,--max-candidates UINT:POSITIVE=5

maximal number of reference reads considered as reference

-e,--edit-script-mult FLOAT:POSITIVE=1

multipier for predicted cost of storing read part as edit script

-r,--max-recurence-level UINT:NONNEGATIVE=3

maximal level of recurence when considering alternative reference reads

--min-to-alt UINT:POSITIVE=64

minimum length of encoding part to consider using alternative read

--min-mmer-frac FLOAT:NONNEGATIVE=0.5

if A is set of m-mers in encode read R then read is refused from encoding if |A| < min-mmer-frac * len(R)

--min-mmer-force-enc FLOAT:NONNEGATIVE=0.9

if A is set of m-mers in encode read R then read is accepted to encoding always if |A| > min-mmer-force-enc * len(R)

--max-matches-mult FLOAT:NONNEGATIVE=10

if the number of matches between encode read R and reference read is r, then read is refused from encoding if r > max-matches-mult * len(R)

--fill-factor-filtered-kmers FLOAT:FLOAT in [0.1 - 0.99]=0.75

fill factor of filtered k-mers hash table

--fill-factor-kmers-to-reads FLOAT:FLOAT in [0.1 - 0.99]=0.8

fill factor of k-mers to reads hash table

--min-anchors UINT=1 if number of anchors common to encode read and reference candidate is lower than minAnchors candidate is refused

-i,--identifier TEXT:{main,none,org}=org

header compression mode

-R,--Ref-reads-mode TEXT:{all,sparse}=sparse

reference reads mode

-x,--sparse-exponent FLOAT=1

sparse mode exponent

> colord compress-pbhifi -h

compress PacBio HiFi data

Usage: colord compress-pbhifi [OPTIONS] input output

Positionals:

input TEXT:FILE REQUIRED input FASTQ/FASTA path (gzipped or not)

output TEXT REQUIRED archive path

Options:

-h,--help Print this help message and exit

-k,--kmer-len UINT:INT in [15 - 28]

k-mer length (default: auto adjust)

-t,--threads UINT=32 number of threads

-p,--priority TEXT:{balanced,memory,ratio}

compression quality

-q,--qual TEXT:{2-avg,2-fix,4-avg,4-fix,5-avg,5-fix,avg,none,org}=5-avg

quality compression mode

* org - original,

* none - discard (Q1 for all bases),

* avg - average over entire file,

* 2-fix, 4-fix, 5-fix - 2/4/5 bins with fixed representatives,

* 2-avg, 4-avg, 5-avg - 2/4/5 bins with averages as representatives.

-T,--qual-thresholds UINT ...

quality thresholds,

* single value for '2-fix' mode (default is 7),

* single value for '2-avg' mode (default is 7),

* three values for '4-fix' mode (default is 7 14 26),

* three values for '4-avg' mode (default is 7 14 26),

* four values for '5-fix' mode (default is 7 14 26 93),

* four values for '5-avg' mode (default is 7 14 26 93),

* not allowed for 'avg', 'org' and 'none' modes

-D,--qual-values UINT ...

quality values for decompression,

* single value for 'none' mode (default is 0),

* two values for '2-fix' mode (default is 1 13),

* four values for '4-fix' mode (default is 3 10 18 35),

* five values for '5-fix' mode (default is 3 10 18 35 93),

* not allowed for 'avg', 'org', '2-avg', '4-avg' and '5-avg' modes

-G,--reference-genome TEXT:FILE

optional reference genome path (multi-FASTA gzipped or not), enables reference-based mode which provides better compression ratios

-s,--store-reference stores the reference genome in the archive, use only with `-G` flag

-v,--verbose verbose

-a,--anchor-len UINT anchor len (default: auto adjust)

-L,--Lowest-count UINT=3 minimal k-mer count

-H,--Highest-count UINT=100 maximal k-mer count

-f,--filter-modulo UINT=40 k-mers for which hash(k-mer) mod f != 0 will be filtered out before graph building

-c,--max-candidates UINT:POSITIVE=8

maximal number of reference reads considered as reference

-e,--edit-script-mult FLOAT:POSITIVE=1

multipier for predicted cost of storing read part as edit script

-r,--max-recurence-level UINT:NONNEGATIVE=5

maximal level of recurence when considering alternative reference reads

--min-to-alt UINT:POSITIVE=48

minimum length of encoding part to consider using alternative read

--min-mmer-frac FLOAT:NONNEGATIVE=0.5

if A is set of m-mers in encode read R then read is refused from encoding if |A| < min-mmer-frac * len(R)

--min-mmer-force-enc FLOAT:NONNEGATIVE=0.9

if A is set of m-mers in encode read R then read is accepted to encoding always if |A| > min-mmer-force-enc * len(R)

--max-matches-mult FLOAT:NONNEGATIVE=10

if the number of matches between encode read R and reference read is r, then read is refused from encoding if r > max-matches-mult * len(R)

--fill-factor-filtered-kmers FLOAT:FLOAT in [0.1 - 0.99]=0.75

fill factor of filtered k-mers hash table

--fill-factor-kmers-to-reads FLOAT:FLOAT in [0.1 - 0.99]=0.8

fill factor of k-mers to reads hash table

--min-anchors UINT=1 if number of anchors common to encode read and reference candidate is lower than minAnchors candidate is refused

-i,--identifier TEXT:{main,none,org}=org

header compression mode

-R,--Ref-reads-mode TEXT:{all,sparse}=sparse

reference reads mode

-g,--sparse-range FLOAT=3 sparse mode range. The propability of reference read acceptance is 1/pow(id/range_reads, exponent), where range_reads is determined based on the number of symbols, which in turn is determined by the number of trusted unique k-mers (estimated genome length) multiplied by the value of this parameter

-x,--sparse-exponent FLOAT=1

sparse mode exponent

> colord decompress -h

decompression mode

Usage: colord decompress [OPTIONS] input output

Positionals:

input TEXT:FILE REQUIRED archive path

output TEXT REQUIRED output file path

Options:

-h,--help Print this help message and exit

-G,--reference-genome TEXT:FILE

optional reference genome path (multi-FASTA gzipped or not), required for reference-based archives with no reference genome embedded (`-G` compression without `-s` switch)

-v,--verbose verbose

実行方法

デフォルト設定。損失あり（lossy）。入力ファイル名、出力ファイル名の順番に指定する。

#ONT
colord compress-ont input.fastq output.default

#PacBio CLR/subreads
colord compress-pbraw input.fastq output.default

#PacBio HiFi
colord compress-pbhifi input.fastq output.default

ONTのリードに適用したところ、1/11ほどに圧縮された。

解凍する。

colord decompress input output.fq

アーカイブの情報を調べるにはinfoコマンドを使う。

colord info ont_reads.compressed

圧縮。損失無し（lossless）。

colord compress-ont -q org input.fastq output.lossless

#より圧縮率を高めるためにリファレンスゲノムも指定、10スレッド使用。verboseモード。
colord compress-ont -q org -t 10 -G ref.fasta -v input.fastq output.lossless

-q quality compression mode
org - original,
none - discard (Q1 for all bases),
avg - average over entire file,
2-fix, 4-fix, 5-fix - 2/4/5 bins with fixed representatives,
2-avg, 4-avg, 5-avg - 2/4/5 bins with averages as representatives.
-G optional reference genome path (multi-FASTA gzipped or not), enables reference-based mode which provides better compression ratios
-t number of threads
-v verbose mode.
-p {balanced, memory, ratio} compression quality

上と同じONTのリードに適用したところ、3/11ほどに圧縮された。

Githubより

CoLoRdのパラメータ数は多いが、ほとんどの場合、デフォルト値で問題なく動作する。圧縮に関しては、圧縮率と必要なリソース（主にメモリと計算時間）の間に常にトレードオフが存在する。CoLoRdのデフォルトの動作が不十分な場合、最初に試みるべきは圧縮優先モードの変更（-pパラメータ）である。デフォルト設定ではメモリ優先モードである。
クオリティスコアは圧縮に大きな影響を与える。その性質上圧縮しにくく、同時に（論文で紹介されているように）下流の解析に影響を与えることなく安全に解像度を下げることができる。このため、各優先度モードでは、品質スコアは非可逆圧縮される。もし、オリジナルの品質スコアを保持する必要がある場合は、-q orgを使用する必要がある。なお、他にもいくつかの品質圧縮モードが存在する（論文参照）。

ヒトNA12878の各パラメータ設定での圧縮サイズと必要な時間、必要なメモリーがGithubで紹介されています。

引用

CoLoRd: compressing long reads
Marek Kokot, Adam Gudyś, Heng Li & Sebastian Deorowicz
Nature Methods, Published: 28 March 2022