大規模なゲノムプロジェクトはますます一般的になりつつあり、その結果、数千もの個々のゲノムデータセットからなるVCF(Variant Call Format; (Danecek et al., 2011))ファイルが作成される。圧縮された形式であっても、このようなファイルは非常に大きく(通常数GB)、長期的なデータ保存とファイル転送のコストが急速に上昇し、より効率的な圧縮アルゴリズムの開発に拍車がかかっている。
最近、一握りの新しい圧縮アルゴリズムが出現してe.g. (Durbin, 2014; Deorowicz and Danek, 2019; Kelleher et al., 2019)、これらはVCFファイル内のgenotypesを圧縮することによって機能するが、genotypesは、VCFファイル内で表現される1つのデータ型に過ぎず、多くの場合、総データ内容に対するマイナーなコントリビュータに過ぎない。例えば、実世界の例として使用された(Durbin, 2014)のファイル(我々のベンチマークのFile1)では、genotypesは非圧縮のVCFファイルデータの7.1%しか表していない。このように、genotypesを圧縮するだけでは、VCFファイルの圧縮戦略としては十分ではないことは明らかである。
ここで、本著者らはロスレス圧縮ツールであるgenozipを紹介する。genozipは、任意のploidy、Phasing structure、またはvariantタイプのVCFファイルを扱うことができる。genozipの主な目的は、ゲノムデータを効率的にパッケージングすることである。
また、パイプラインの解析機能も備えている。
genozip パッケージはすべての一般的なオペレーティングシステム上で動作し、genozip, genounzip, genocat, genols の 4 つのコマンドラインツールが含まれている。 genounzip は .vcf.genozip ファイルを .vcf または .vcf.gz 形式に解凍し、genols は .genozip ファイルの内容に関する統計情報を提供する。
分析パイプラインへのシームレスな統合をサポートするために、.vcf.genozip ファイル内のデータにアクセスするための genocat コマンドが提供されており、データへのランダムアクセスを可能にする --regions や -samples などのオプションが含まれている。インデックス作成は圧縮の一部として行われ、個別のインデックス作成ステップやインデックスファイルはない。さらに、このツールセットは標準的な入出力ストリームを使用できるように設計されている。
データを--password(256ビットAESを使用)で暗号化することで、genozipは厳しいプライバシー要件を満たすゲノムファイルの効率的かつ安全な配布を可能にする。データの完全性は、--md5でMD5署名を生成することでさらに保証される。さらに、--outputオプションは、同一のサンプルを含むVCFファイルを連結し、genounzip -splitを使用してオリジナルのコンポーネントを再生成することができる。
ユーザーが必要に応じて圧縮を変更できるように、いくつかのオプションを追加した。まず、--optimize オプションでは、INFO と FORMAT のサブフィールドのデータを修正し、浮動小数点数を有効数字 2 桁に丸めたり、Phred 値に上限を設定したりすることで圧縮を改善する。この場合、VCFデータが変更されるため、圧縮はロスレスではないが、下流の解析結果には影響しないことに注意する。第二に、--gtshark オプションは GTShark (Deorowicz and Danek, 2019) を使用しており、genozip または GTShark を単独で使用する場合と比較して圧縮率を向上させる (補足資料を参照)。最後に、--vblockおよび--sblockオプションを使用すると、領域やサンプルのサブセット化に関連する圧縮と速度の間のトレードオフを制御することができる。
.bcf ファイルを .genozip 形式に圧縮するには bcftools が、.xz ファイルを圧縮するには XZ Utils (Collin, 2011) が、.vcf.gz に解凍するには bgzip が、--gtshark を使うには GTShark が、URL から圧縮するには cURL が必要である (Hostetter et al., 1997)。
インストール
macos10.14でテストした。
git clone https://github.com/divonlan/genozip.git
cd genozip/
make -j
> ./genozip
$ ./genozip
Compress VCF (Variant Call Format) files
Usage: genozip [options]... [files or urls]...
One or more file names or URLs may be given, or if omitted, standard input is used instead
Supported input file types: .vcf .vcf.gz .vcf.bgz .vcf.bz2 .vcf.xz .bcf .bcf.gz .bcf.bgz
Note: for .bcf files, bcftools needs to be installed, and for .xz files, xz needs to be installed
Examples: genozip file1.vcf file2.vcf -o concat.vcf.genozip
genozip --optimize -password 12345 ftp://ftp.ncbi.nlm.nih.gov/file2.vcf.gz
See also: genounzip genocat genols
Actions - use at most one of these actions:
-d --decompress Same as running genounzip. For more details, run: genounzip --help
-l --list Same as running genols. For more details, run: genols --help
-h --help Show this help page. Use with -f to see developer options.
-L --license Show the license terms and conditions for this product
-V --version Display version number
Flags:
-c --stdout Send output to standard output instead of a file
-f --force Force overwrite of the output file, or force writing .vcf.genozip data to standard output
-^ --replace Replace the source file with the result file, rather than leaving it unchanged
-o --output <output-filename>. This option can also be used to concatenate multiple input files with the same individuals, into a single concatenated output file
-p --password <password>. Password-protected - encrypted with 256-bit AES
-m --md5 Calculate the MD5 hash of the VCF file. When the resulting file is decompressed, this MD5 will be compared to the MD5 of the decompressed VCF.
Note: for compressed files, e.g. myfile.vcf.gz, the MD5 calculated is that of the original, uncompressed file.
-q --quiet Don't show the progress indicator or warnings
-Q --noisy The --quiet is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings
-t --test After compressing normally, decompresss in memory (i.e. without writing the decompressed file to disk) - comparing the MD5 of the resulting decompressed
file to that of the original VCF. This option also activates --md5
-@ --threads <number>. Specify the maximum number of threads. By default, this is set to the number of cores available. The number of threads actually used may be
less, if sufficient to balance CPU and I/O.
--show-content Show the information content of VCF files and the compression ratios of each component
Optimizing:
-9 --optimize Modify the VCF file in ways that are likely insignificant for analytical purposes, but make a significant difference for compression. At the moment,
these optimizations include:
- PL data: Phred values of over 60 are changed to 60. Example: '0,18,270' -> '0,18,60'
- GL data: Numbers are rounded to 2 significant digits. Example: '-2.61618,-0.447624,-0.193264' -> '-2.6,-0.45,-0.19'
- GP data: Numbers are rounded to 2 significant digits, as with GL.
- VQSLOD data: Number is rounded to 2 significant digits. Example: '-4.19494' -> '-4.2'
Note: due to these data modifications, files compressed with --optimized are NOT identical as the original VCF after decompression. For this reason, it
is not possible to use this option in combination with --test or --md5
-B --vblock <number between 1 and 2048>. Set the maximum size of memory (in megabytes) of VCF file data that can go into one variant block. By default, this is set
to 128 MB. The variant block is the basic unit of data on which genozip and genounzip operate. This value affects a number of things: 1. Memory
consumption of both compression and decompression are linear with the variant block size. 2. Compression is sometimes better with larger block sizes, in
particular if the number of samples is small. 3. Smaller blocks will result in faster 'genocat --regions' lookups
-S --sblock <number>. Set the number of samples per sample block. By default, it is set to 4096. When compressing or decompressing a variant block, the samples
within the block are divided to sample blocks which are compressed separately. A higher value will result in a better compression ratio, while a lower
value will result in faster 'genocat --samples' lookups
-K --gtshark Use gtshark instead of the default bzlib as the final compression step for allele data (the GT subfield in the sample data).
Note: For this to work, gtshark needs to be installed - it is a separate software package that is not affiliated with genozip in any way. It can be found
here: https://github.com/refresh-bio/GTShark
Note: gtshark also needs to be installed for decompressing files that were compressed with this option.
genozip is available for free for non-commercial use and some other limited use cases. See 'genozip -L for details'. Commercial use requires a commercial license
Bug reports and feature requests: bugs@genozip.com
Commercial license inquiries: sales@genozip.com
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
> ./genounzip
$ ./genounzip
Uncompress VCF (Variant Call Format) files previously compressed with genozip
Usage: genounzip [options]... [files]...
One or more file names must be given
Examples: genounzip file1.vcf.genozip file2.vcf.genozip
genounzip file.vcf.genozip --output file.vcf.gz
genounzip concat.vcf.genozip --split
See also: genozip genocat genols
Options:
-c --stdout Send output to standard output instead of a file
-z --bgzip Compress the output VCF file(s) with bgzip
Note: this option is implicit if --output specifies a filename ending with .gz or .bgz
Note: bgzip needs to be installed for this option to work
-f --force Force overwrite of the output file
-^ --replace Replace the source file with the result file, rather than leaving it unchanged
-O --split Split a concatenated file back to its original components
-o --output <output-filename>. Output to this filename instead of the default one
-p --password <password>. Provide password to access file(s) that were compressed with --password
-m --md5 Show the MD5 hash of the decompressed VCF file. If the file was originally compressed with --md5, it also verifies that the MD5 of the original VCF file
is identical to the MD5 of the decompressed VCF.
Note: for compressed files, e.g. myfile.vcf.gz, the MD5 calculated is that of the original, uncompressed file.
-q --quiet Don't show the progress indicator or warnings
-Q --noisy The --quiet is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings
-t --test Decompress in memory (i.e. without writing the decompressed file to disk) - comparing the MD5 of the resulting decompressed file to that of the original
VCF. Works only if the file was compressed with --md5
-@ --threads <number>. Specify the maximum number of threads. By default, this is set to the number of cores available. The number of threads actually used may be
less, if sufficient to balance CPU and I/O
-h --help Show this help page. Use with -f to see developer options.
-L --license Show the license terms and conditions for this product
-V --version Display version number
Bug reports and feature requests: bugs@genozip.com
Commercial license inquiries: sales@genozip.com
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
> ./genocat
$ ./genocat
Print VCF (Variant Call Format) file(s) previously compressed with genozip
Usage: genocat [options]... [files]...
One or more file names must be given
See also: genozip genounzip genols
Options:
-r --regions [^]chr|chr:pos|pos|chr:from-to|chr:from-|chr:-to|from-to|from-|-to[,...]
Show one or more regions of the file. Examples:
genocat myfile.vcf.genozip -r22:1000000-2000000 (A range of chromosome 22)
genocat myfile.vcf.genozip -r-2000000,2500000- (Two ranges of all chromosomes)
genocat myfile.vcf.genozip -r21,22 (All of chromosome 21 and 22)
genocat myfile.vcf.genozip -r^MT,Y (All of chromosomes except for MT and Y)
genocat myfile.vcf.genozip -r^-10000 (All sites on all chromosomes, except positions up to 10000)
Note: genozip files are indexed automatically during compression. There is no separate indexing step or separate index file
Note: Indels are considered part of a region if their start position is
Note: Multiple -r arguments may be specified - this is equivalent to chaining their regions with a comma separator in a single argument
-t --targets Identical to --regions, provided for pipeline compatibility
-s --samples [^]sample[,...]
Show a subset of samples (individuals). Examples:
genocat myfile.vcf.genozip -s HG00255,HG00256 (show two samples)
genocat myfile.vcf.genozip -s ^HG00255,HG00256 (show all samples except these two)
Note: This does not change the INFO data (including the AC and AN tags)
Note: sample names are case-sensitive
Note: Multiple -s arguments may be specified - this is equivalent to chaining their samples with a comma separator in a single argument
-G --drop-genotypes Output the data without the individual genotypes and FORMAT column
-H --no-header Don't output the VCF header
-1 --header-one Don't output the VCF header, except for the last line (with the field and sample names)
--header-only Output only the VCF header
--GT-only For samples, output only genotype (GT) data, dropping the other subfields
--strip Don't output values for ID, QUAL, FILTER, INFO; FORMAT is only GT (at most); Samples include allele values (i.e. GT subfield) only
-o --output <output-filename>. Output to this filename instead of stdout
-p --password Provide password to access file(s) that were compressed with --password
-@ --threads Specify the maximum number of threads. By default, this is set to the number of cores available. The number of threads actually used may be less, if
sufficient to balance CPU and I/O
-q --quiet Don't show warnings
-Q --noisy The --quiet is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings
-h --help Show this help page. Use with -f to see developer options. Use --header-only if that is what you are looking for
-L --license Show the license terms and conditions for this product
-V --version Display version number
Bug reports and feature requests: bugs@genozip.com
Commercial license inquiries: sales@genozip.com
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
実行方法
vcfを圧縮する。
genozip input.vcf
#複数ファイル。出力も指定。
genozip file1.vcf file2.vcf -o concat.vcf.genozip
input.vcf.genozipが出力される。
解凍する。
genounzip input.vcf.genozip
圧縮状態で閲覧する。
genocat input.vcf.genozip |less
#chr1の1-10000
genocat -r chr1:1-10000 input.vcf.genozip |less
引用
genozip: a fast and efficient compression tool for VCF files
Divon Lan, Raymond Tobler, Yassine Souilmi, Bastien Llamas Author Notes
Bioinformatics, Published: 14 May 2020