GeCoはゲノム(fasta)の圧縮ツール。高効率な圧縮を行うことができる(ロスレスかどうかは不明)。
公式サイト
http://bioinformatics.ua.pt/software/geco/
インストール
https://github.com/pratas/geco
ダウンロードしてビルドする。
brew install cmake wget gcc48 #ない人だけ
git clone https://github.com/pratas/geco.git
cd geco/src/
cmake .
make
> ./GeCo
$ ./GeCo
Usage: GeCo [OPTION]... -r [FILE] [FILE]:[...]
Compress and analyze a genomic sequence (by default, compress).
Non-mandatory arguments:
-h give this help,
-x show several running examples,
-s show GeCo compression levels,
-v verbose mode (more information),
-V display version number,
-f force overwrite of output,
-l <level> level of compression [1;9] (lazy -tm setup),
-g <gamma> mixture decayment forgetting factor. It is
a real value in the interval [0;1),
-c <cache> maximum collisions for hash cache. Memory
values are higly dependent of the parameter
specification,
-e it creates a file with the extension ".iae"
with the respective information content. If
the file is FASTA or FASTQ it will only use
the "ACGT" (genomic) data,
-rm <c>:<d>:<i>:<m/e> reference context model (ex:-rm 13:100:0:0/0),
-rm <c>:<d>:<i>:<m/e> reference context model (ex:-rm 18:1000:0:1/1000),
...
-tm <c>:<d>:<i>:<m/e> target context model (ex:-tm 4:1:0:0/0),
-tm <c>:<d>:<i>:<m/e> target context model (ex:-tm 18:20:1:2/10),
...
target and reference templates use <c> for
context-order size, <d> for alpha (1/<d>),
<i> (0 or 1) to set the usage of inverted
repeats (1 to use) and <m> to the maximum
allowed mutation on the context without
being discarded (usefull in deep contexts),
under the estimator <e>,
-r <FILE> reference file ("-rm" are loaded here),
Mandatory arguments:
<FILE> file to compress (last argument). For more
files use splitting ":" characters.
Report bugs to <{pratas,ap,pjf}@ua.pt>.
> ./GeDe
$ ./GeDe
Usage: GeDe [OPTION]... -r [FILE] [FILE]:[...]
Decompress a genomic sequence compressed by GeCo.
Non-mandatory arguments:
-h give this help,
-v verbose mode (more information),
-r <FILE> reference file,
Mandatory arguments:
<FILE> file to uncompress (last argument). For
more files use splitting ":" characters.
Report bugs to <{pratas,ap,pjf}@ua.pt>.
パスを通しておく。
ラン
レベル5で圧縮する。
GeCo -l 5 File.seq
- -l <level> level of compression [1;9] (lazy -tm setup),
デコード
GeDe File.seq.co
fastqを圧縮すると、ヘッダーやqualityが消去されて1つの配列になってしまうので使わないこと。
引用
Efficient Compression of Genomic Sequences
D. Pratas, A. J. Pinho, P. J. S. G. Ferreira.
March 2016 DOI10.1109/DCC.2016.60 ConferenceData Compression ConferenceAtSnowbird, Utah