2020-05-19

効率的なVCFの圧縮器と関連ツールを提供する genozip

　大規模なゲノムプロジェクトはますます一般的になりつつあり、その結果、数千もの個々のゲノムデータセットからなるVCF（Variant Call Format; (Danecek et al., 2011))ファイルが作成される。圧縮された形式であっても、このようなファイルは非常に大きく（通常数GB）、長期的なデータ保存とファイル転送のコストが急速に上昇し、より効率的な圧縮アルゴリズムの開発に拍車がかかっている。
　最近、一握りの新しい圧縮アルゴリズムが出現してe.g. (Durbin, 2014; Deorowicz and Danek, 2019; Kelleher et al., 2019)、これらはVCFファイル内のgenotypesを圧縮することによって機能するが、genotypesは、VCFファイル内で表現される１つのデータ型に過ぎず、多くの場合、総データ内容に対するマイナーなコントリビュータに過ぎない。例えば、実世界の例として使用された(Durbin, 2014)のファイル(我々のベンチマークのFile1）では、genotypesは非圧縮のVCFファイルデータの7.1%しか表していない。このように、genotypesを圧縮するだけでは、VCFファイルの圧縮戦略としては十分ではないことは明らかである。
ここで、本著者らはロスレス圧縮ツールであるgenozipを紹介する。genozipは、任意のploidy、Phasing structure、またはvariantタイプのVCFファイルを扱うことができる。genozipの主な目的は、ゲノムデータを効率的にパッケージングすることである。
また、パイプラインの解析機能も備えている。

genozip パッケージはすべての一般的なオペレーティングシステム上で動作し、genozip, genounzip, genocat, genols の 4 つのコマンドラインツールが含まれている。 genounzip は .vcf.genozip ファイルを .vcf または .vcf.gz 形式に解凍し、genols は .genozip ファイルの内容に関する統計情報を提供する。
　分析パイプラインへのシームレスな統合をサポートするために、.vcf.genozip ファイル内のデータにアクセスするための genocat コマンドが提供されており、データへのランダムアクセスを可能にする --regions や -samples などのオプションが含まれている。インデックス作成は圧縮の一部として行われ、個別のインデックス作成ステップやインデックスファイルはない。さらに、このツールセットは標準的な入出力ストリームを使用できるように設計されている。
　データを--password(256ビットAESを使用)で暗号化することで、genozipは厳しいプライバシー要件を満たすゲノムファイルの効率的かつ安全な配布を可能にする。データの完全性は、--md5でMD5署名を生成することでさらに保証される。さらに、--outputオプションは、同一のサンプルを含むVCFファイルを連結し、genounzip -splitを使用してオリジナルのコンポーネントを再生成することができる。

　ユーザーが必要に応じて圧縮を変更できるように、いくつかのオプションを追加した。まず、--optimize オプションでは、INFO と FORMAT のサブフィールドのデータを修正し、浮動小数点数を有効数字 2 桁に丸めたり、Phred 値に上限を設定したりすることで圧縮を改善する。この場合、VCFデータが変更されるため、圧縮はロスレスではないが、下流の解析結果には影響しないことに注意する。第二に、--gtshark オプションは GTShark (Deorowicz and Danek, 2019) を使用しており、genozip または GTShark を単独で使用する場合と比較して圧縮率を向上させる (補足資料を参照)。最後に、--vblockおよび--sblockオプションを使用すると、領域やサンプルのサブセット化に関連する圧縮と速度の間のトレードオフを制御することができる。
　.bcf ファイルを .genozip 形式に圧縮するには bcftools が、.xz ファイルを圧縮するには XZ Utils (Collin, 2011) が、.vcf.gz に解凍するには bgzip が、--gtshark を使うには GTShark が、URL から圧縮するには cURL が必要である (Hostetter et al., 1997)。

インストール

macos10.14でテストした。

Github

git clone https://github.com/divonlan/genozip.git
cd genozip/
make -j

> ./genozip

$ ./genozip

Compress VCF (Variant Call Format) files

Usage: genozip [options]... [files or urls]...

One or more file names or URLs may be given, or if omitted, standard input is used instead

Supported input file types: .vcf .vcf.gz .vcf.bgz .vcf.bz2 .vcf.xz .bcf .bcf.gz .bcf.bgz

Note: for .bcf files, bcftools needs to be installed, and for .xz files, xz needs to be installed

Examples: genozip file1.vcf file2.vcf -o concat.vcf.genozip

genozip --optimize -password 12345 ftp://ftp.ncbi.nlm.nih.gov/file2.vcf.gz

See also: genozip genounzip genols

Options:

Show one or more regions of the file. Examples:

genocat myfile.vcf.genozip -r22:1000000-2000000 (A range of chromosome 22)

genocat myfile.vcf.genozip -r-2000000,2500000- (Two ranges of all chromosomes)

genocat myfile.vcf.genozip -r21,22 (All of chromosome 21 and 22)

genocat myfile.vcf.genozip -r^MT,Y (All of chromosomes except for MT and Y)

genocat myfile.vcf.genozip -r^-10000 (All sites on all chromosomes, except positions up to 10000)

Note: genozip files are indexed automatically during compression. There is no separate indexing step or separate index file

Note: Indels are considered part of a region if their start position is

Note: Multiple -r arguments may be specified - this is equivalent to chaining their regions with a comma separator in a single argument

-t --targets Identical to --regions, provided for pipeline compatibility

-s --samples [^]sample[,...]

Show a subset of samples (individuals). Examples:

genocat myfile.vcf.genozip -s HG00255,HG00256 (show two samples)

genocat myfile.vcf.genozip -s ^HG00255,HG00256 (show all samples except these two)

Note: This does not change the INFO data (including the AC and AN tags)

Note: sample names are case-sensitive

Note: Multiple -s arguments may be specified - this is equivalent to chaining their samples with a comma separator in a single argument

-G --drop-genotypes Output the data without the individual genotypes and FORMAT column

-H --no-header Don't output the VCF header

-1 --header-one Don't output the VCF header, except for the last line (with the field and sample names)

--header-only Output only the VCF header

--GT-only For samples, output only genotype (GT) data, dropping the other subfields

--strip Don't output values for ID, QUAL, FILTER, INFO; FORMAT is only GT (at most); Samples include allele values (i.e. GT subfield) only

-o --output <output-filename>. Output to this filename instead of stdout

-p --password Provide password to access file(s) that were compressed with --password

-@ --threads Specify the maximum number of threads. By default, this is set to the number of cores available. The number of threads actually used may be less, if

sufficient to balance CPU and I/O

-q --quiet Don't show warnings

-Q --noisy The --quiet is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings

-h --help Show this help page. Use with -f to see developer options. Use --header-only if that is what you are looking for

-L --license Show the license terms and conditions for this product

-V --version Display version number

Bug reports and feature requests: bugs@genozip.com

Commercial license inquiries: sales@genozip.com

THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR

PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN

CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

実行方法

vcfを圧縮する。

genozip input.vcf 

#複数ファイル。出力も指定。
genozip file1.vcf file2.vcf -o concat.vcf.genozip

input.vcf.genozipが出力される。

解凍する。

genounzip input.vcf.genozip

圧縮状態で閲覧する。

genocat input.vcf.genozip |less

#chr1の1-10000
genocat -r chr1:1-10000 input.vcf.genozip |less

引用

genozip: a fast and efficient compression tool for VCF files
Divon Lan, Raymond Tobler, Yassine Souilmi, Bastien Llamas Author Notes
Bioinformatics, Published: 14 May 2020

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

効率的なVCFの圧縮器と関連ツールを提供する genozip