macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

効率的なゲノムファイル(FASTQ、SAM/BAM/CRAM、VCF、GVF、FASTA、PHYLIP、23andMe)の圧縮器 Genozip

2021 7/20 DocumentaitonのURL修正

2021 10/9 コマンド追記

2021 11/9 追記

 

 ゲノムデータのための汎用的で機能が充実した圧縮ソフトウェアであるGenozipを紹介する。Genozipは、汎用性(一般的なゲノムファイル形式をすべてサポート)、高圧縮率、高速性、機能性、拡張性の5つのコア機能を提供することで、ゲノム圧縮のための汎用ソフトウェアおよび開発フレームワークとして設計されている。

 Genozipは、FASTQ、SAM/BAM/CRAM、VCF、GVF、FASTA、PHYLIP、23andMeフォーマットなど、ゲノム研究で広く使われているゲノムデータフォーマットに対応した高性能圧縮を提供する。テスト結果は、Genozipが高速で、ファイルがすでに圧縮されている場合でも、大幅に改善された圧縮率を達成していることを示している。

 さらに、Genozipは、ファイルフォーマットに特化したセグメンタやデータタイプに特化したコーデックからGenozipフレームワークを分離して設計されている。これにより、Genozipは、研究者が将来的に追加のファイルフォーマットや、ファイル内のデータタイプやフィールドのための新しいコーデックのための圧縮を実装できる汎用圧縮プラットフォームになることを目指している。これにより、最終的には、ユーザーコミュニティによるこれらのアルゴリズムの認知度と採用率が向上し、この分野でのさらなる技術革新が加速することを期待している。

 GenozipはC言語で書かれている。コードはオープンソースで、GitHub (https://github.com/divonlan/genozip)で公開されている。このパッケージは非商用利用の場合は無料である。DockerHub上のDockerコンテナとして、またcondaパッケージマネージャを通じて配布されている。GenozipはLinuxMacWindowsでテストされている。

 

Documentaiton

https://genozip.readthedocs.io

 

  • Genozipはゲノムファイル用の圧縮機でFASTQ、SAM/BAM/CRAM、VCF/BCF、FASTA、GVF、Phylip、23andMeファイルを圧縮するように最適化されているが、ゲノムファイルだけでなく、あらゆるファイルを圧縮できる。すでに .gz .bz2 .xz で圧縮されている場合も圧縮できる。
  • 圧縮率は圧縮されるデータに依存する。通常、.bam を圧縮する場合は 1.5~3 倍、.fastq.gz ファイル (つまり既に圧縮されているファイルを圧縮する場合) では 2~5 倍、GT データのみを含む非圧縮の高サンプルカウント .vcf ファイルを圧縮する場合は最大 200 倍の圧縮率が期待できる。
  • 圧縮はロスレスで、解凍されたファイルは元のファイルと100%同じ。厳密なロスレス化の例外は --optimize オプションを使用した場合。
  • 元のファイルが BGZF で圧縮されていた場合、genounzip は解凍時に BGZF でファイルを再圧縮する(--plain が指定されていない時)。しかし、使用されているライブラリが異なるために、全く同じ BGZF 圧縮はできないことがある。

f:id:kazumaxneo:20210218233123p:plain

マニュアルより

 

2021 7/3 

Genozip v12のリリース。圧縮・解析機能の段階的な向上(RELEASE NOTES参照)に加えて、2つの大きな機能が追加された。1つ目はDual coordinate VCFのサポート。Dual coordinate VCF(link)は、例えばGRCh37とGRCh38のように、2つの座標系の座標を同時に含むVCFファイル。2つ目は、 kraken2を使ったBAMファイルのspeciesフィルタリング機能。バクテリアのリードを直接特定することで、ヒトゲノムデータからバクテリアの汚染をフィルタリングすることができる(詳細)。この機能はBAMファイル(FASTQだけではない)で動作するので、解析のどの時点でも使用することができる。

 

11/9 

new benchmarks of Genozip

https://genozip.readthedocs.io/benchmarks.html

 

インストール

ソースからのビルド推奨。(12.08はsegmentation errrorを起こしたので12.07をソースからビルド)

#Form github
git clone https://github.com/divonlan/genozip
make
requires: gcc or clang, make

#conda、ここでは高速なmambaを使う
mamba install -c conda-forge genozip -y

> genozip

$ genozip

 

Compress genomics files. Genozip can compress any file, but is optimally designed to compress the following file types:

   VCF/BCF, SAM/BAM/CRAM, FASTQ, FASTA, GVF and 23andMe

 

Usage: genozip [options]... [files or urls]...

 

One or more file names or URLs may be given, or if omitted, standard input is used instead

 

Supported input file types, as recognized by their listed filename extension(s): 

   FASTA:   fasta, fa, faa, ffn, fnn, fna (possibly .gz .bgz .bz2 .xz)

   FASTQ:   fastq, fq (possibly .gz .bgz .bz2 .xz)

   SAM:     sam (possibly .gz .bgz .bz2 .xz)

   BAM:     bam

   CRAM:    cram

   VCF:     vcf (possibly .gz .bgz .bz2 .xz)

   BCF:     bcf (possibly .gz .bgz)

   GVF:     gvf (possibly .gz .bgz .bz2 .xz)

   23andMe: genome*Full*.txt (possibly zip)

   Generic: any other file (possibly .gz .bgz .bz2 .xz)

 

Note: for comressing .bcf, .cram or .xz files requires bcftools, samtools or xz, respectively, to be installed, as does using --index

 

Examples: genozip sample.bam

          genozip sample.R1.fq.gz sample.R2.fq.gz --pair --reference hg19.ref.genozip -o sample.genozip          genozip --optimize -password 12345 ftp://ftp.ncbi.nlm.nih.gov/file2.vcf.gz

 

See also: genounzip genocat genols

 

Actions - use at most one of these actions:

   -d --decompress   Same as running genounzip. For more details, run: genounzip --help

 

   -l --list         Same as running genols. For more details, run: genols --help

 

   -h --help         <topic> Show this help page. Optional <topic> can be:

                     dev   - list of developer options

                     input - list of possible arguments of --input

 

   -L --license      Show the license terms and conditions for this product

 

   -V --version      Display version number

 

Flags:

   -i --input        <data-type>. data-type is one of the supported input file types listed above, examples: bam vcf.gz fq.xz. See "genozip --help=input" for full list of accepted file types

 

                     This flag should be used when redirecting input data with a < or |, or if the input file type cannot be determined by its file name

 

   -f --force        Force overwrite of the output file, or force writing .genozip data to standard output

 

   -^ --replace      Replace the source file with the result file, rather than leaving it unchanged

 

   -o --output       <output-filename>. This option can also be used to bind multiple input files into a single genozip file. The files can be later unbound with 'genounzip --unbind'. To bind files, they must be of the same type (VCF, SAM etc) and if they

                     are VCF files, they must contain the same samples. genozip takes advantage of similarities between the input files so that the bound file is usually smaller than the combined size of individually compressed files

 

      --best         Best compression, but slower than --fast mode. This is the default mode of genozip - this flag has no additional effect.

 

   -F --fast         Fast compression, but lower compression ratio than --best. Files compressed with this option also uncompress faster. Compressing with this option also consumes less memory.

 

   -p --password     <password>. Password-protected - encrypted with 256-bit AES

 

   -m --md5          Calculate the MD5 digest of the original textual file (vcf, sam...) instead of Adler32. The MD5 is also viewable with genols.

                     Note: for compressed files, e.g. myfile.vcf.gz or myfile.bam, the MD5 calculated is that of the original, uncompressed textual file - myfile.vcf or myfile.sam respectively.

 

   -I --input-size   <file size in bytes> genozip configures its internal data structures to optimize execution speed based on the file size. When redirecting the input file with < or |, genozip cannot determine its size, and this might result in slower

                     execution. This problem can be overcome by using this flag to inform genozip of the file size

 

   -q --quiet        Don't show the progress indicator or warnings

 

   -Q --noisy        The --quiet is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings

 

   -t --test         After compressing normally, decompresss in memory (i.e. without writing the decompressed file to disk) - comparing the MD5 of the resulting textual (vcf, sam) decompressed file to that of the original textual file. This option also

                     activates --md5

 

   -@ --threads      <number>. Specify the maximum number of threads. By default, genozip uses all the threads it needs to maximize usage of all available cores

 

   -B --vblock       <number between 1 and 2048>. Set the maximum size of data (in megabytes) of the textual input (VCF, SAM, FASTQ etc) data that a thread processes at any given time. By default, Genozip sets this value dynamically based on the

                     characateristics of the file, and it is reported in --show-stats. Smaller values will result in faster subsetting with --regions and --grep, while larger values will result in better compression. Note that memory consumption of both

                     genozip and genounzip is linear with the vblock value used for compression

 

   -e --reference    <filename>.ref.genozip Use a reference file - this is a FASTA file genozipped with the --make-reference option. The same reference needs to be provided to genounzip or genocat.

                     While genozip is capabale of compressing without a reference, in the following cases providing a reference may result in better compression:

                     1. FASTQ files

                     2. SAM/BAM files

                     3. VCF files with significant REFALT content (see "% of zip" in --show-stats)

 

   -E --REFERENCE    <filename>.ref.genozip Similar to --reference, except genozip copies the reference (or part of it) to the output file, so there is no need to specify --reference in genounzip and genocat.

                     Note on using with --password: the copy of the reference file stored in the compressed file is never encrypted

 

   --make-reference  Compresss a FASTA file to be used as a reference in --reference or --REFERENCE. Ignored for non-FASTA files

 

   -w --show-stats   Show the internal structure of a genozip file and the associated compression stats

 

   -W --SHOW-STATS   Show more detailed stats

 

   --register        Register (or re-register) a non-commericial license to use genozip

 

FASTQ-specific options (ignored for other file types):

   -2 --pair         Compress pairs of paired-end FASTQ files, resulting in compression ratios better than compressing the files individually. When using this option, every two consecutive files on the file list should be paired-end FASTQ files with an

                     identical number of reads and consistent file names, and --reference or --REFERENCE must be specified. The resulting genozip file is a bound file. To display interleaved, use genocat --interleaved, and to unbind the genozip file back

                     to its original FASTQ files, use genounzip --unbind.

 

FASTA-specific options (ignored for other file types):

   --multifasta      All contigs in the FASTA file are variations of a the same contig (i.e. they are somewhat similar to each other). Genozip uses this information to improve the compression.

 

Optimizing:

   -9 --optimize     Modify the file in ways that are likely insignificant for analytical purposes, but significantly improve compression and somewhat improve the speed of genocat --regions. --optimize activates all these optimizations, or they can be

                     activated individually. These optimizations are:

 

   VCF optimizations: 

   --optimize-sort   - INFO subfields are sorted alphabetically.               Example: AN=21;AC=3 -> AC=3;AN=21

   --optimize-PL     - PL data: Phred values of over 60 are changed to 60.     Example: '0,18,270' -> '0,18,60'

   --optimize-GL     - GL data: Numbers are rounded to 2 significant digits.   Example: '-2.61618,-0.447624,-0.193264' -> '-2.6,-0.45,-0.19'

   --optimize-GP     - GP data: Numbers are rounded to 2 significant digits, as with GL.

   --optimize-VQSLOD - VQSLOD data: Number is rounded to 2 significant digits. Example: '-4.19494' -> '-4.2'

 

   SAM optimizations: 

   --optimize-QUAL   - The QUAL quality field and the secondary U2 quality field (if it exists), are modified to group quality scores into a smaller number of bins:

                       Quality scores of 2-9 are changed to 6; 10-19->15 ; 20-24->22 ; 25-29->27 ..... 85-89->87 ; 90-92->91 ; 93 unchanged

                       This assumes a standard Sanger format of Phred quality scores 0->93 encoded in ASCII 33->126

                       Note: this follows Illumina's quality bins for values up to Phred 39, and extends with additional similar bins for values of 40 and above common in some non-Illumina technologies: https://sapac.illumina.com/content/dam/illumina

                     -marketing/documents/products/technotes/technote_understanding_quality_scores.pdf

                       Example: 'LSVIHINKHK' -> 'IIIIFIIIFI'

   --optimize-ZM     - ZM:B:s data: negative Ion Torrent flow signal values are changed to zero, and positives are rounded to the nearest 10.

                       Example: '-20,212,427' -> '0,210,430'

 

   FASTQ optimizations: 

   --optimize-DESC   - Replaces the description line with '@filename:read_number'

                       Example: '@A00488:61:HMLGNDSXX:4:1101:1561:1000 2:N:0:CTGAAGCT+ATAGAGGC' -> '@sample.fq.gz:100' (100 is the read sequential number within this fastq file)

   --optimize-QUAL   - The quality data is optimized as described for SAM above

 

   GVF optimizations: 

   --optimize-sort   - Attributes are sorted alphabetically.                   Example: Notes=hi;ID=rs12 -> ID=rs12;Notes=hi

   --optimize-Vf     - Variant_freq data: Number is rounded to 2 significant digits. Example: '0.006351' -> '0.0064'

 

                     Note: due to these data modifications, files compressed with --optimize are NOT identical to the original file after decompression. For this reason, it is not possible to use this option in combination with --test or --md5

 

genozip is available for free for non-commercial use and some other limited use cases. See 'genozip -L for details'. Commercial use requires a commercial license

 

Citing: Lan, D., et al. Bioinformatics, Volume 36, Issue 13, July 2020, Pages 4091-4092

 

Bug reports and feature requests: bugs@genozip.com

Commercial license inquiries: sales@genozip.com

Requests for support for compression of additional public or proprietary genomic file formats: sales@genozip.com

 

THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR

ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

> genocat

$ genocat

 

Print original genomic file(s) previously compressed with genozip

 

Usage: genocat [options]... [files]...

 

One or more file names must be given

 

See also: genozip genounzip genols

 

Reference-file related options:

   -e --reference    <filename>.ref.genozip Load a reference file prior to decompressing. Required for files compressed with --reference

 

                     When no non-reference file is specified, display the reference data itself. Typically used in combination with --regions

 

   -E --REFERENCE    <filename>.ref.genozip with no non-reference file specified. Display the reverse complement of the reference data itself. Typically used in combination with --regions

 

   --show-reference  Show the name and MD5 of the reference file that needs to be provided to uncompress this file

 

Subsetting options (options resulting in modified display of the data):

   --downsample      <rate> Show only one in every <rate> lines (or reads in the case of FASTQ). Other subsetting options, if any, will be applied to the surviving lines only.

 

   --interleaved     For FASTQ data compressed with --pair: Show every pair of paired-end FASTQ files with their reads interleaved: first one read of the first file, then a read from the second file, then the next read from the first file and so on.

 

   -r --regions      [^]chr|chr:pos|pos|chr:from-to|chr:from-|chr:-to|from-to|from-|-to|from+len[,...]

   VCF SAM FASTA     Show one or more regions of the file. Examples:

   GVF 23andMe                 genocat myfile.vcf.genozip -r22:1000-2000  (Positions 1000 to 2000 on contig 22)

   ref                         genocat myfile.sam.genozip -r22:1000+151   (151 bases, starting pos 1000, on contig 22)

                               genocat myfile.vcf.genozip -r-2000,2500-   (Two ranges on all contigs)

                               genocat myfile.sam.genozip -rchr21,chr22   (Contigs chr21 and chr22 in their entirety)

                               genocat myfile.vcf.genozip -r^MT,Y         (All contigs, excluding MT and Y)

                               genocat myfile.vcf.genozip -r^-1000        (All contigs, excluding positions up to 1000)

                               genocat myfile.fa.genozip  -rchrM          (Contig chrM)

                     Note: genozip files are indexed automatically during compression. There is no separate indexing step or separate index file

                     Note: Indels are considered part of a region if their start position is

                     Note: Multiple -r arguments may be specified - this is equivalent to chaining their regions with a comma separator in a single argument

                     Note: For FASTA files, only whole-contig regions are possible

 

   -s --samples      [^]sample[,...]

   VCF               Show a subset of samples (individuals). Examples:

                               genocat myfile.vcf.genozip -s HG00255,HG00256    (show two samples)

                               genocat myfile.vcf.genozip -s ^HG00255,HG00256   (show all samples except these two)

                     Note: This does not change the INFO data (including the AC and AN tags)

                     Note: sample names are case-sensitive

                     Note: Multiple -s arguments may be specified - this is equivalent to chaining their samples with a comma separator in a single argument

 

   -g --grep         <string> Show only records in which <string> is a case-sensitive substring of the description

   FASTQ FASTA

 

   --list-chroms     List the names of the chromosomes (or contigs) included in the file

   VCF SAM FASTA GVF 23andMe 

 

   -G --drop-genotypes Output the data without the samples and FORMAT column

   VCF

 

   -H --no-header    Don't output the header lines

 

   -1 --header-one   VCF: Output only the last line on the header (the line with the field and sample names)

   VCF FASTA         FASTA: Output the sequence name up to the first space or tab

 

   --header-only     Output only the header lines

 

   --GT-only         For samples, output only genotype (GT) data, dropping the other subfields

   VCF

 

   --sequential      Output in sequential format - each sequence in a single line

   FASTA

 

Translation options (options resulting convertion of the data from one format to another):

      --bam          (SAM and BAM only) Output as BAM.

                     Note: this option is implicit if --output specifies a filename ending with .bam

 

      --sam          (SAM and BAM only) Output as SAM. This option is the default in genocat on SAM and BAM data.

 

      --fastq        (SAM and BAM only) Output as FASTQ

                     The alignments are outputed as FASTQ reads in the order they appear in the SAM/BAM file. Alignments with FLAG 16 (reverse complimented) have their SEQ reverse complimented and their QUAL reversed. Alignments with FLAG 4 (unmapped) or

                     256 (secondary) are dropped. Alignments with FLAG 64 (or 128) (the first (or last) segment in the template) have a '1' (or '2') added after the read name. Usually, if the original order of the SAM/BAM file has not been tampered with,

                     this would result in a valid interleaved FASTQ file.

                     Note: this option is implicit if --output specifies a filename ending with .fq[.gz] or .fastq[.gz]

 

      --bcf          (VCF only) Output as BCF, using bcftools

                     Note: bcftools needs to be installed for this option to work

 

      --phylip       (FASTA only) Output a Multi-FASTA in Phylip format. All sequences must be the same length

 

      --fasta        (Phylip only) Output as Multi-FASTA

 

      --vcf          (23andMe only) Output as VCF. --vcf must be used in combination with --reference to specify the reference file as listed in the header of the 23andMe file (usually this is GRCh37)

                     Note: INDEL genotypes ('DD', 'DI', 'II') as well as uncalled sites ('--') are discarded

 

General options:

   -o --output       <output -filename>. Output to this filename instead of stdout

 

   -z --bgzf         <level>. Compress the output to the BGZF format (.gz extension) using libdeflate, at the compression level specified in the argument. Argument specifies the compression level from 0 (no compression) to 12 (best, yet slowest,

                     compression). If you're not sure what value to choose, 6 is a popular option.

                     Note: by default, genocat's output is not compressed. Unlike genounzip, genocat makes no attempt to reconstruct the compression level of the original file

 

   -p --password     Provide password to access file(s) that were compressed with --password

 

   -@ --threads      Specify the maximum number of threads. By default, genozip uses all the threads it needs to maximize usage of all available cores

 

   -x --index        Create an index file alongside the decompressed file, when combined with --output. The index file is created using 'samtools index' for SAM/BAM files, 'samtools faidx' for FASTA/FASTQ files and 'bcftools index' for VCF files. Other

                     file formats cannot be indexed

 

   -q --quiet        Don't show warnings

 

   -Q --noisy        The --quiet is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings

 

   -w --show-stats   Show the internal structure of a genozip file and the associated compression stats

 

   -W --SHOW-STATS   Show more detailed stats

 

   -h --help         <topic> Show this help page. Optional <topic> can be:

                     dev - list of developer options

 

   -L --license      Show the license terms and conditions for this product

 

   -V --version      Display version number

 

Tip regarding using genozip files in a pipeline:

 

   Option 1: For tools that support input redirection - use a regular pipe. Example:

             genocat myfile.vcf.genozip | bcftools view -

 

   Option 2: For tools that don't support input redirection - use a named pipe. Example:

             mkfifo mypipe.vcf

             genocat myfile.vcf.genozip > mypipe.vcf &

             othertool mypipe.vcf

 

Citing: Lan, D., et al. Bioinformatics, Volume 36, Issue 13, July 2020, Pages 4091-4092

 

Bug reports and feature requests: bugs@genozip.com

Commercial license inquiries: sales@genozip.com

Requests for support for compression of additional public or proprietary genomic file formats: sales@genozip.com

 

THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR

ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

> genounzip

$ genounzip

 

Uncompress genomic files previously compressed with genozip

 

Usage: genounzip [options]... [files]...

 

One or more file names must be given

Examples: genounzip file1.vcf.genozip file2.sam.genozip

          genounzip file.vcf.genozip --output file.vcf.gz

          genounzip bound.vcf.genozip --unbind

 

See also: genozip genocat genols

 

Options:

   -x --index        Create an index file alongside the decompressed file. The index file is created using 'samtools index' for SAM/BAM files, 'samtools faidx' for FASTA/FASTQ files and 'bcftools index' for VCF files. Other file formats cannot be indexed

 

   -c --stdout       Send output to standard output instead of a file

 

   -z --bgzf         <level>. Compress the output to the BGZF format (.gz extension) using libdeflate, at the compression level specified in the argument. Argument specifies the compression level from 0 (no compression) to 12 (best, yet slowest,

                     compression). If you're not sure what value to choose, 6 is a popular option.

                     Note: by default, absent this option, genozip will attempt to re-create the same BGZF compression as in the original file. Whether genozip succeeds in re-creating the exact same BGZF compression ratio depends on the compression library

                     used by the application that generated the original file.

 

      --no-PG        (SAM and BAM only) When converting a file from SAM to BAM or vice versa, Genozip normally adds a @PG line in the header. With this option, it doesn't

 

   -f --force        Force overwrite of the output file

 

   -^ --replace      Replace the source file with the result file, rather than leaving it unchanged

 

   -u --unbind[=prefix] Split a bound file back to its original components. If the '--unbind=prefix' form is used, a prefix is added to each file component. A prefix may include a directory.

 

   -o --output       <output-filename>. Output to this filename instead of the default one

 

   -e --reference    <filename>.ref.genozip Load a reference file prior to decompressing. Required for files compressed with --reference

 

   -p --password     <password>. Provide password to access file(s) that were compressed with --password

 

   -m --md5          Show the digest of the decompressed file - MD5 if the file was compressed with --md5 or --test and Adler32 if not.

                     Note: for compressed files, e.g. myfile.vcf.gz, the digest calculated is that of the original, uncompressed file. 

 

   -q --quiet        Don't show the progress indicator or warnings

 

   -Q --noisy        The --quiet is turned on by default when outputting to the terminal. --noisy stops the suppression of warnings

 

   -t --test         Decompress in memory (i.e. without writing the decompressed file to disk) and use the digest (MD5 or Adler32) to verify that the resulting decompressed file is identical to the original file.

 

   -@ --threads      <number>. Specify the maximum number of threads. By default, genozip uses all the threads it needs to maximize usage of all available cores

 

   -w --show-stats   Show the internal structure of a genozip file and the associated compression stats

 

   -W --SHOW-STATS   Show more detailed stats

 

   -h --help         <topic> Show this help page. Optional <topic> can be:

                     dev - list of developer options

 

   -L --license      Show the license terms and conditions for this product

 

   -V --version      Display version number

 

Translation options (options resulting convertion of the data from one format to another):

      --bam          (SAM and BAM only) Output as BAM.

                     Note: this option is implicit if --output specifies a filename ending with .bam

 

      --sam          (SAM and BAM only) Output as SAM. This option is the default in genocat on SAM and BAM data.

 

      --fastq        (SAM and BAM only) Output as FASTQ

                     The alignments are outputed as FASTQ reads in the order they appear in the SAM/BAM file. Alignments with FLAG 16 (reverse complimented) have their SEQ reverse complimented and their QUAL reversed. Alignments with FLAG 4 (unmapped) or

                     256 (secondary) are dropped. Alignments with FLAG 64 (or 128) (the first (or last) segment in the template) have a '1' (or '2') added after the read name. Usually, if the original order of the SAM/BAM file has not been tampered with,

                     this would result in a valid interleaved FASTQ file.

                     Note: this option is implicit if --output specifies a filename ending with .fq[.gz] or .fastq[.gz]

 

      --bcf          (VCF only) Output as BCF, using bcftools

                     Note: bcftools needs to be installed for this option to work

 

      --phylip       (FASTA only) Output a Multi-FASTA in Phylip format. All sequences must be the same length

 

      --fasta        (Phylip only) Output as Multi-FASTA

 

      --vcf          (23andMe only) Output as VCF. --vcf must be used in combination with --reference to specify the reference file as listed in the header of the 23andMe file (usually this is GRCh37)

                     Note: INDEL genotypes ('DD', 'DI', 'II') as well as uncalled sites ('--') are discarded

 

 

Citing: Lan, D., et al. Bioinformatics, Volume 36, Issue 13, July 2020, Pages 4091-4092

 

Bug reports and feature requests: bugs@genozip.com

Commercial license inquiries: sales@genozip.com

Requests for support for compression of additional public or proprietary genomic file formats: sales@genozip.com

 

THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR

ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

> genols 

$ genols -h

 

View metadata of genomic files previously compressed with genozip

 

Usage: genols [options]... [files or directories]...

 

One or more file or directory names may be given, or if omitted, genols runs on the current directory

 

See also: genozip genounzip genocat

 

Options:

   -u --unbind       Show the components of bound files. This option is implied when running genols on a single file

 

   -b --bytes        Show sizes in bytes

 

   -q --quiet        Don't show warnings

 

   -h --help         Show this help page

 

   -L --license      Show the license terms and conditions for this product

 

   -V --version      Display version number

 

Citing: Lan, D., et al. Bioinformatics, Volume 36, Issue 13, July 2020, Pages 4091-4092

 

Bug reports and feature requests: bugs@genozip.com

Commercial license inquiries: sales@genozip.com

Requests for support for compression of additional public or proprietary genomic file formats: sales@genozip.com

 

THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR

ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

 

 

 

実行方法

初回は端末上で非商用の利用であるかの確認と、ユーザー登録のためのメッセージが表示される。最後に登録するのメッセージが出る。登録しないと使用できない。

 

 fastqを圧縮する。--replaceをつけると元のファイルは消える。

genozip input.fastq --replace

#fastq.gzをさらに圧縮(データによっては2/3くらいになる)
genozip input.fq.gz

#複数ファイル(個別に圧縮される)
genozip fastq_dir/*.fq.gz
  • --replace      Replace the source file with the result file, rather than leaving it unchanged 
  • -F --fast Fast compression,   but lower compression ratio than --best. Files compressed with this option also uncompress faster. Compressing with this option also consumes less memory.
  • -p  --password <password>.  Password-protected - encrypted with 256-bit AES
  • -@ --threads <number>.  Specify the maximum number of threads. By default, genozip uses all the threads it needs to maximize usage of all available cores

input.fastq.genozipができる。gzip圧縮ファイルも.gzが消えてinput.fastq.genozipとして出力される。

 

解凍する。--replaceをつけると元のファイルは消える。gzip圧縮ファイルをさらに圧縮した場合も解凍するとraw  fastqに解凍される。

genounzip input.fastq.genozip --replace
  • -c --stdout    Send output to standard output instead of a file
  • --replace      Replace the source file with the result file, rather than leaving it unchanged

 

リファレンスゲノム(fasta)とそのリファレンスに対応するペアエンドfastqを圧縮する。 単体で行うより圧縮効率が高くなる(よりファイルサイズが小さくなる)。

#まずはリファレンスを圧縮(少し時間がかかるがリファレンスは1回圧縮すれば使い回せる)
genozip --make-reference genome.fasta
#=> genome.ref.genozipができる

#次にペアエンドfastqを圧縮、paired-end fastqとgenome.ref.genozipを指定する。
genozip -2 pair_1.fq pair_2.fq -e genome.ref.genozip
  • --make-reference   Compresss a FASTA file to be used as a reference in --reference or --REFERENCE. Ignored for non-FASTA files
  • -e --reference <filename>.  ref.genozip Use a reference file - this is a FASTA file genozipped with the --make-reference option. The same reference needs to be provided to genounzip or genocat.
    While genozip is capabale of compressing without a reference, in the following cases providing a reference may result in better compression:
    1. FASTQ files
    2. SAM/BAM files
    3. VCF files with significant REFALT content (see "% of zip" in --show-stats)

  • -E --REFERENCE <filename>.  ref.genozip Similar to --reference, except genozip copies the reference (or part of it) to the output file, so there is no need to specify --reference in genounzip and genocat.
    Note on using with --password: the copy of the reference file stored in the compressed file is never encrypted

  • -2 --pair   Compress pairs of paired-end FASTQ files, resulting in compression ratios better than compressing the files individually. When using this option, every two consecutive files on the file list should be paired-end FASTQ files with an identical number of reads and consistent file names, and --reference or --REFERENCE must be specified. The resulting genozip file is a bound file. To display interleaved, use genocat --interleaved, and to unbind the genozip file back to its original FASTQ files, use genounzip --unbind. 

pair_1+2.fastq.genozipが出力される。

 

Genozip圧縮ファイルのメタデータを確認する。

genols pair_1+2.fastq.genozip

 

2021 7/3

Dual coordinate VCFの作成。

genozip --chain mychain.chain.genozip mydata.vcf

 

 

fastqの例を書きましたが、他にもたくさんのファイルタイプやファイルタイプの組み合わせに対応しています。サポートされているファイルタイプの完全なリストは 'genozip --help=input'で確認できます。間違って2回紹介してしまいましたが、残しておきます。

引用

Genozip - A Universal Extensible Genomic Data Compressor
Divon Lan, Ray Tobler, Yassine Souilmi, Bastien Llamas Author Notes
Bioinformatics, Published: 15 February 2021

 

追記

2022/09

Genozip 14 - advances in compression of BAM and CRAM files
Divon Lan,  Bastien Llamas

bioRxiv, posted September 14, 2022

 

関連

 

自分用

カレント以下のfq.gzを全て圧縮

find . -name "*.fq.gz" -type f -print0 | while read -r -d '' file; do
genozip "$file" "--replace"; done