macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

vcf/bcfから必要なフィールドだけ問い合わせる bcftoolsのqueryコマンド

2020 10/14 タイトル変更

 

manual

http://samtools.github.io/bcftools/bcftools.html

BCFtools HowTo

Extracting information from VCFs

 

bcftoolsのインストール

Github

#bioconda (link)
conda install -c bioconda -y bcftools

bcftools

# bcftools 

 

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)

Version: 1.8 (using htslib 1.8)

 

Usage:   bcftools [--version|--version-only] [--help] <command> <argument>

 

Commands:

 

 -- Indexing

    index        index VCF/BCF files

 

 -- VCF/BCF manipulation

    annotate     annotate and edit VCF/BCF files

    concat       concatenate VCF/BCF files from the same set of samples

    convert      convert VCF/BCF files to different formats and back

    isec         intersections of VCF/BCF files

    merge        merge VCF/BCF files files from non-overlapping sample sets

    norm         left-align and normalize indels

    plugin       user-defined plugins

    query        transform VCF/BCF into user-defined formats

    reheader     modify VCF/BCF header, change sample names

    sort         sort VCF/BCF file

    view         VCF/BCF conversion, view, subset and filter VCF/BCF files

 

 -- VCF/BCF analysis

    call         SNP/indel calling

    consensus    create consensus sequence by applying VCF variants

    cnv          HMM CNV calling

    csq          call variation consequences

    filter       filter VCF/BCF files using fixed thresholds

    gtcheck      check sample concordance, detect sample swaps and contamination

    mpileup      multi-way pileup producing genotype likelihoods

    roh          identify runs of autozygosity (HMM)

    stats        produce VCF/BCF stats

 

 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected

 automatically even when streaming from a pipe. Indexed VCF and BCF will work

 in all situations. Un-indexed VCF and BCF and streams will work in most but

 not all situations.

 

> bcftools query

# bcftools query

 

About:   Extracts fields from VCF/BCF file and prints them in user-defined format

Usage:   bcftools query [options] <A.vcf.gz> [<B.vcf.gz> [...]]

 

Options:

    -e, --exclude <expr>              exclude sites for which the expression is true (see man page for details)

    -f, --format <string>             see man page for details

    -H, --print-header                print header

    -i, --include <expr>              select sites for which the expression is true (see man page for details)

    -l, --list-samples                print the list of samples and exit

    -o, --output-file <file>          output file name [stdout]

    -r, --regions <region>            restrict to comma-separated list of regions

    -R, --regions-file <file>         restrict to regions listed in a file

    -s, --samples <list>              list of samples to include

    -S, --samples-file <file>         file of samples to include

    -t, --targets <region>            similar to -r but streams rather than index-jumps

    -T, --targets-file <file>         similar to -R but streams rather than index-jumps

    -u, --allow-undef-tags            print "." for undefined tags

    -v, --vcf-list <file>             process multiple VCFs listed in the file

 

Examples:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz

> bcftools view

# bcftools view             

 

About:   VCF/BCF conversion, view, subset and filter VCF/BCF files.

Usage:   bcftools view [options] <in.vcf.gz> [region1 [...]]

 

Output options:

    -G,   --drop-genotypes              drop individual genotype information (after subsetting if -s option set)

    -h/H, --header-only/--no-header     print the header only/suppress the header in VCF output

    -l,   --compression-level [0-9]     compression level: 0 uncompressed, 1 best speed, 9 best compression [-1]

          --no-version                  do not append version and command line to the header

    -o,   --output-file <file>          output file name [stdout]

    -O,   --output-type <b|u|z|v>       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]

    -r, --regions <region>              restrict to comma-separated list of regions

    -R, --regions-file <file>           restrict to regions listed in a file

    -t, --targets [^]<region>           similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix

    -T, --targets-file [^]<file>        similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix

        --threads <int>                 number of extra (de)compression threads [0]

 

Subset options:

    -a, --trim-alt-alleles        trim alternate alleles not seen in the subset

    -I, --no-update               do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN)

    -s, --samples [^]<list>       comma separated list of samples to include (or exclude with "^" prefix)

    -S, --samples-file [^]<file>  file of samples to include (or exclude with "^" prefix)

        --force-samples           only warn about unknown subset samples

 

Filter options:

    -c/C, --min-ac/--max-ac <int>[:<type>]      minimum/maximum count for non-reference (nref), 1st alternate (alt1), least frequent

                                                   (minor), most frequent (major) or sum of all but most frequent (nonmajor) alleles [nref]

    -f,   --apply-filters <list>                require at least one of the listed FILTER strings (e.g. "PASS,.")

    -g,   --genotype [^]<hom|het|miss>          require one or more hom/het/missing genotype or, if prefixed with "^", exclude sites with hom/het/missing genotypes

    -i/e, --include/--exclude <expr>            select/exclude sites for which the expression is true (see man page for details)

    -k/n, --known/--novel                       select known/novel sites only (ID is not/is '.')

    -m/M, --min-alleles/--max-alleles <int>     minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)

    -p/P, --phased/--exclude-phased             select/exclude sites where all samples are phased

    -q/Q, --min-af/--max-af <float>[:<type>]    minimum/maximum frequency for non-reference (nref), 1st alternate (alt1), least frequent

                                                   (minor), most frequent (major) or sum of all but most frequent (nonmajor) alleles [nref]

    -u/U, --uncalled/--exclude-uncalled         select/exclude sites without a called genotype

    -v/V, --types/--exclude-types <list>        select/exclude comma-separated list of variant types: snps,indels,mnps,ref,bnd,other [null]

    -x/X, --private/--exclude-private           select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples

 

 

 

実行方法

テストデータ

#test data
git clone https://github.com/samtools/bcftools.git
cd bcftools/test/

 

bcftools view

先にvcf <=> bcf変換などができるbcftools viewをまとめておく(他にも機能あり)。

#vcf => bcf
bcftools view -Ou input.vcf -o output.bcf

#bcf => vcf
bcftools view -Ov input.bcf -o output.vcf

#vcf => vcf.gz
bcftools view -Oz input.vcf -o output.vcf.gz

#bcf => vcf.gz
bcftools view -Oz input.bcf -o output.vcf.gz

#vcf => compressed bcf
bcftools view -Ob input.vcf -o output.bcf.gz

#bcf => compressed bcf
bcftools view -Ob input.bcf -o output.bcf.gz

#indexing(tabix)installするにはhtslibをビルドするかcondaを使う(*1)
tabix -p vcf output.vcf.gz
=> output.vcf.gz.tbiができる

#BGZF compressin and create index (bgzip)installするにはhtslibをビルドするかcondaを使う(*2)
bgzip -i output.bcf
=> output.bcf.gz.gbiができる(-@ 4だと4スレッド使用)
  • -O <b|u|z|v>       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]

 

bcftools query

vcf/bcfに含まれるサンプルのリスト

#-lをつけて実行
bcftools query -l mpileup.vcf 
  • -l     print the list of samples and exit

# bcftools query -l mpileup.vcf 

HG00100

HG00101

HG00102

 

VCFの必要なフィールドだけ取り出す。シンプルなタブ区切りリストに変換されるので、全てのフィールドを指定することでVCF/BCF => タブ区切り変換コマンドとしても機能する。

#bcf/vcfからPOSフィールドだけ取り出す
bcftools query -f '%POS\n' file.bcf

#chr、positon、 ref, alt(allele)だけ取り出す。
bcftools query -f '%CHROM %POS %REF %ALT\n' file.vcf

#FORMATタグの一部を取り出す場合は[]で囲んで指定
bcftools query -f '%CHROM %POS[\t%AF\t%DP]\n' file.bcf

#手動で全てのフィールドを指定(異なるフィールドがあれば修正してください)
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t[\t%AC\t%AF\t%AN\t%BaseQRankSum\t%DP\t%ExcessHet\t%FS\t%MLEAF\t%MQ\t%MQRankSum\t%QD\t%ReadPosRankSum\t%SOR\t%ANN]\n' inpuy.bcf.gz
  •  -f,   --apply-filters <list>   require at least one of the listed FILTER strings (e.g. "PASS,.")

 

 引用

https://samtools.github.io/bcftools/bcftools.html

 

関連


*1, *2

conda install -c bioconda -y tabix

#from source (v1.11)
wget https://github.com/samtools/htslib/releases/download/1.11/htslib-1.11.tar.bz2
tar -jxvf htslib-1.11.tar.bz2
cd htslib-1.11/
make -j