2020 10/14 タイトル変更
manual
http://samtools.github.io/bcftools/bcftools.html
BCFtools HowTo
Extracting information from VCFs
bcftoolsのインストール
#bioconda (link)
conda install -c bioconda -y bcftools
> bcftools
# bcftools
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.8 (using htslib 1.8)
Usage: bcftools [--version|--version-only] [--help] <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
sort sort VCF/BCF file
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
csq call variation consequences
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
mpileup multi-way pileup producing genotype likelihoods
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.
> bcftools query
# bcftools query
About: Extracts fields from VCF/BCF file and prints them in user-defined format
Usage: bcftools query [options] <A.vcf.gz> [<B.vcf.gz> [...]]
Options:
-e, --exclude <expr> exclude sites for which the expression is true (see man page for details)
-f, --format <string> see man page for details
-H, --print-header print header
-i, --include <expr> select sites for which the expression is true (see man page for details)
-l, --list-samples print the list of samples and exit
-o, --output-file <file> output file name [stdout]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --samples <list> list of samples to include
-S, --samples-file <file> file of samples to include
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
-u, --allow-undef-tags print "." for undefined tags
-v, --vcf-list <file> process multiple VCFs listed in the file
Examples:
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
> bcftools view
# bcftools view
About: VCF/BCF conversion, view, subset and filter VCF/BCF files.
Usage: bcftools view [options] <in.vcf.gz> [region1 [...]]
Output options:
-G, --drop-genotypes drop individual genotype information (after subsetting if -s option set)
-h/H, --header-only/--no-header print the header only/suppress the header in VCF output
-l, --compression-level [0-9] compression level: 0 uncompressed, 1 best speed, 9 best compression [-1]
--no-version do not append version and command line to the header
-o, --output-file <file> output file name [stdout]
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-t, --targets [^]<region> similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix
-T, --targets-file [^]<file> similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix
--threads <int> number of extra (de)compression threads [0]
Subset options:
-a, --trim-alt-alleles trim alternate alleles not seen in the subset
-I, --no-update do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN)
-s, --samples [^]<list> comma separated list of samples to include (or exclude with "^" prefix)
-S, --samples-file [^]<file> file of samples to include (or exclude with "^" prefix)
--force-samples only warn about unknown subset samples
Filter options:
-c/C, --min-ac/--max-ac <int>[:<type>] minimum/maximum count for non-reference (nref), 1st alternate (alt1), least frequent
(minor), most frequent (major) or sum of all but most frequent (nonmajor) alleles [nref]
-f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.")
-g, --genotype [^]<hom|het|miss> require one or more hom/het/missing genotype or, if prefixed with "^", exclude sites with hom/het/missing genotypes
-i/e, --include/--exclude <expr> select/exclude sites for which the expression is true (see man page for details)
-k/n, --known/--novel select known/novel sites only (ID is not/is '.')
-m/M, --min-alleles/--max-alleles <int> minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)
-p/P, --phased/--exclude-phased select/exclude sites where all samples are phased
-q/Q, --min-af/--max-af <float>[:<type>] minimum/maximum frequency for non-reference (nref), 1st alternate (alt1), least frequent
(minor), most frequent (major) or sum of all but most frequent (nonmajor) alleles [nref]
-u/U, --uncalled/--exclude-uncalled select/exclude sites without a called genotype
-v/V, --types/--exclude-types <list> select/exclude comma-separated list of variant types: snps,indels,mnps,ref,bnd,other [null]
-x/X, --private/--exclude-private select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples
実行方法
テストデータ
#test data
git clone https://github.com/samtools/bcftools.git
cd bcftools/test/
先にvcf <=> bcf変換などができるbcftools viewをまとめておく(他にも機能あり)。
#vcf => bcf
bcftools view -Ou input.vcf -o output.bcf
#bcf => vcf
bcftools view -Ov input.bcf -o output.vcf
#vcf => vcf.gz
bcftools view -Oz input.vcf -o output.vcf.gz
#bcf => vcf.gz
bcftools view -Oz input.bcf -o output.vcf.gz
#vcf => compressed bcf
bcftools view -Ob input.vcf -o output.bcf.gz
#bcf => compressed bcf
bcftools view -Ob input.bcf -o output.bcf.gz
#indexing(tabix)installするにはhtslibをビルドするかcondaを使う(*1)
tabix -p vcf output.vcf.gz
=> output.vcf.gz.tbiができる
#BGZF compressin and create index (bgzip)installするにはhtslibをビルドするかcondaを使う(*2)
bgzip -i output.bcf
=> output.bcf.gz.gbiができる(-@ 4だと4スレッド使用)
- -O <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
vcf/bcfに含まれるサンプルのリスト
#-lをつけて実行
bcftools query -l mpileup.vcf
- -l print the list of samples and exit
# bcftools query -l mpileup.vcf
HG00100
HG00101
HG00102
VCFの必要なフィールドだけ取り出す。シンプルなタブ区切りリストに変換されるので、全てのフィールドを指定することでVCF/BCF => タブ区切り変換コマンドとしても機能する。
#bcf/vcfからPOSフィールドだけ取り出す
bcftools query -f '%POS\n' file.bcf
#chr、positon、 ref, alt(allele)だけ取り出す。
bcftools query -f '%CHROM %POS %REF %ALT\n' file.vcf
#FORMATタグの一部を取り出す場合は[]で囲んで指定
bcftools query -f '%CHROM %POS[\t%AF\t%DP]\n' file.bcf
#手動で全てのフィールドを指定(異なるフィールドがあれば修正してください)
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t[\t%AC\t%AF\t%AN\t%BaseQRankSum\t%DP\t%ExcessHet\t%FS\t%MLEAF\t%MQ\t%MQRankSum\t%QD\t%ReadPosRankSum\t%SOR\t%ANN]\n' inpuy.bcf.gz
- -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.")
引用
https://samtools.github.io/bcftools/bcftools.html
関連
*1, *2
conda install -c bioconda -y tabix
#from source (v1.11)
wget https://github.com/samtools/htslib/releases/download/1.11/htslib-1.11.tar.bz2
tar -jxvf htslib-1.11.tar.bz2
cd htslib-1.11/
make -j