2019 8/5 bcftools help追加
2019 8/30追記
2019 11/11追記
2020 3/20 bowtiee2コマンド修正
2021 5/24 dockerhubのイメージへのリンク追加
変異株のリファレンスをゲノムに当て、その個体についてコンセンサス配列を作成したいことがある。 これはbcftoolsのconsensusコマンドを使って実行可能である。
https://samtools.github.io/bcftools/howtos/consensus-sequence.html
インストール
bcftoolsとsamtolosはcondaで導入できる。 バージョンによって使えるサブコマンドが異なるので注意する。
#bioconda (link)
conda install -c bioconda -y bcftools
#docker image(biocontainers)
#v1.9
docker pull biocontainers/bcftools:v1.9-1-deb_cv1
#v1.5
docker pull biocontainers/bcftools:v1.5_cv3
#docker image(mgibio/bcftools-cwl)
#v1.12
docker pull mgibio/bcftools-cwl:1.12
> bcftools
$ bcftools
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
Version: 1.9 (using htslib 1.9)
Usage: bcftools [--version|--version-only] [--help] <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
sort sort VCF/BCF file
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
csq call variation consequences
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
mpileup multi-way pileup producing genotype likelihoods
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.
> bcftools mpileup
$ bcftools mpileup
Usage: bcftools mpileup [options] in1.bam [in2.bam [...]]
Input options:
-6, --illumina1.3+ quality is in the Illumina-1.3+ encoding
-A, --count-orphans do not discard anomalous read pairs
-b, --bam-list FILE list of input BAM filenames, one per line
-B, --no-BAQ disable BAQ (per-Base Alignment Quality)
-C, --adjust-MQ INT adjust mapping quality; recommended:50, disable:0 [0]
-d, --max-depth INT max per-file depth; avoids excessive memory usage [250]
-E, --redo-BAQ recalculate BAQ on the fly, ignore existing BQs
-f, --fasta-ref FILE faidx indexed reference sequence file
--no-reference do not require fasta reference file
-G, --read-groups FILE select or exclude read groups listed in the file
-q, --min-MQ INT skip alignments with mapQ smaller than INT [0]
-Q, --min-BQ INT skip bases with baseQ/BAQ smaller than INT [13]
-r, --regions REG[,...] comma separated list of regions in which pileup is generated
-R, --regions-file FILE restrict to regions listed in a file
--ignore-RG ignore RG tags (one BAM = one sample)
--rf, --incl-flags STR|INT required flags: skip reads with mask bits unset
--ff, --excl-flags STR|INT filter flags: skip reads with mask bits set
[UNMAP,SECONDARY,QCFAIL,DUP]
-s, --samples LIST comma separated list of samples to include
-S, --samples-file FILE file of samples to include
-t, --targets REG[,...] similar to -r but streams rather than index-jumps
-T, --targets-file FILE similar to -R but streams rather than index-jumps
-x, --ignore-overlaps disable read-pair overlap detection
Output options:
-a, --annotate LIST optional tags to output; '?' to list
-g, --gvcf INT[,...] group non-variant sites into gVCF blocks according
to minimum per-sample DP
--no-version do not append version and command line to the header
-o, --output FILE write output to FILE [standard output]
-O, --output-type TYPE 'b' compressed BCF; 'u' uncompressed BCF;
'z' compressed VCF; 'v' uncompressed VCF [v]
--threads INT number of extra output compression threads [0]
SNP/INDEL genotype likelihoods options:
-e, --ext-prob INT Phred-scaled gap extension seq error probability [20]
-F, --gap-frac FLOAT minimum fraction of gapped reads [0.002]
-h, --tandem-qual INT coefficient for homopolymer errors [100]
-I, --skip-indels do not perform indel calling
-L, --max-idepth INT maximum per-file depth for INDEL calling [250]
-m, --min-ireads INT minimum number gapped reads for indel candidates [1]
-o, --open-prob INT Phred-scaled gap open seq error probability [40]
-p, --per-sample-mF apply -m and -F per-sample for increased sensitivity
-P, --platforms STR comma separated list of platforms for indels [all]
Notes: Assuming diploid individuals.
> bcftools norm
$ bcftools norm
About: Left-align and normalize indels; check if REF alleles match the reference;
split multiallelic sites into multiple rows; recover multiallelics from
multiple rows.
Usage: bcftools norm [options] <in.vcf.gz>
Options:
-c, --check-ref <e|w|x|s> check REF alleles and exit (e), warn (w), exclude (x), or set (s) bad sites [e]
-D, --remove-duplicates remove duplicate lines of the same type.
-d, --rm-dup <type> remove duplicate snps|indels|both|all|none
-f, --fasta-ref <file> reference sequence (MANDATORY)
-m, --multiallelics <-|+>[type] split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both]
--no-version do not append version and command line to the header
-N, --do-not-normalize do not normalize indels (with -m or -c s)
-o, --output <file> write output to a file [standard output]
-O, --output-type <type> 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --strict-filter when merging (-m+), merged site is PASS only if all sites being merged PASS
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
--threads <int> number of extra (de)compression threads [0]
-w, --site-win <int> buffer for sorting lines which changed position during realignment [1000]
> bcftools filter
$ bcftools filter
About: Apply fixed-threshold filters.
Usage: bcftools filter [options] <in.vcf.gz>
Options:
-e, --exclude <expr> exclude sites for which the expression is true (see man page for details)
-g, --SnpGap <int> filter SNPs within <int> base pairs of an indel
-G, --IndelGap <int> filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass
-i, --include <expr> include only sites for which the expression is true (see man page for details
-m, --mode [+x] "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass
--no-version do not append version and command line to the header
-o, --output <file> write output to a file [standard output]
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --soft-filter <string> annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+")
-S, --set-GTs <.|0> set genotypes of failed samples to missing (.) or ref (0)
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
--threads <int> number of extra output compression threads [0]
> bcftools index
About: Index bgzip compressed VCF/BCF files for random access.
Usage: bcftools index [options] <in.bcf>|<in.vcf.gz>
Indexing options:
-c, --csi generate CSI-format index for VCF/BCF files [default]
-f, --force overwrite index if it already exists
-m, --min-shift INT set minimal interval size for CSI indices to 2^INT [14]
-o, --output-file FILE optional output index file name
-t, --tbi generate TBI-format index for VCF files
--threads sets the number of threads [0]
Stats options:
-n, --nrecords print number of records based on existing index file
-s, --stats print per contig stats based on existing index file
> bcftools call
$ bcftools call
About: SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup.
This command replaces the former "bcftools view" caller. Some of the original
functionality has been temporarily lost in the process of transition to htslib,
but will be added back on popular demand. The original calling model can be
invoked with the -c option.
Usage: bcftools call [options] <in.vcf.gz>
File format options:
--no-version do not append version and command line to the header
-o, --output <file> write output to a file [standard output]
-O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
--ploidy <assembly>[?] predefined ploidy, 'list' to print available settings, append '?' for details
--ploidy-file <file> space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --samples <list> list of samples to include [all samples]
-S, --samples-file <file> PED file or a file with an optional column with sex (see man page for details) [all samples]
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
--threads <int> number of extra output compression threads [0]
Input/output options:
-A, --keep-alts keep all possible alternate alleles at variant sites
-f, --format-fields <list> output format fields: GQ,GP (lowercase allowed) []
-F, --prior-freqs <AN,AC> use prior allele frequencies
-g, --gvcf <int>,[...] group non-variant sites into gVCF blocks by minimum per-sample DP
-i, --insert-missed output also sites missed by mpileup but present in -T
-M, --keep-masked-ref keep sites with masked reference allele (REF=N)
-V, --skip-variants <type> skip indels/snps
-v, --variants-only output variant sites only
Consensus/variant calling options:
-c, --consensus-caller the original calling method (conflicts with -m)
-C, --constrain <str> one of: alleles, trio (see manual)
-m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c)
-n, --novel-rate <float>,[...] likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9]
-p, --pval-threshold <float> variant if P(ref|D)<FLOAT with -c [0.5]
-P, --prior <float> mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]
実行方法
1、mapping。ここではbwaのmem、minimap2、smalt、bowtie2(sensitive setting)を試す。
#minimap2
minimap2 -t 12 -ax sr reference.fa pair_1.fq.gz pair_2.fq.gz |samtools sort -@ 12 -O BAM > minimap2.bam
#bwa mem
bwa index -a is reference.fa
bwa mem -t 12 reference.fa pair_1.fq.gz pair_2.fq.gz |samtools sort -@ 12 -O BAM > bwa.bam
#smalt (link)
smalt index smalt_index reference.fa
smalt map -n 18 smalt_index pair_1.fq.gz pair_2.fq.gz |samtools sort -@ 12 -O BAM - > smalt.bam
#bowtie2
bowtie2-build -f reference.fa bowtie2_index
bowtie2 -p 12 --sensitive -x bowtie2_index -1 pair_1.fq.gz -2 pair_2.fq.gz |samtools sort -@ 12 -O BAM -> bowtie2.bam
2、variant call
#variant call
bcftools mpileup -Ou -f reference.fa minimap2.bam | bcftools call -mv -Oz -o minimap2_calls.vcf.gz
bcftools mpileup -Ou -f reference.fa bwa.bam | bcftools call -mv -Oz -o bwa_calls.vcf.gz
bcftools mpileup -Ou -f reference.fa smalt.bam | bcftools call -mv -Oz -o smalt_calls.vcf.gz
bcftools mpileup -Ou -f reference.fa bowtie2.bam | bcftools call -mv -Oz -o bowtie2_calls.vcf.gz
#indexing
bcftools index minimap2_calls.vcf.gz
bcftools index bwa_calls.vcf.gz
bcftools index smalt_calls.vcf.gz
bcftools index bowtie2_calls.vcf.gz
#分岐ポイント
#freebayesを使ってhomoの変異だけ残す。例えばdepth≥4対象にする。
freebayes -C 4 -u -p 2 -f reference.fa mapped.bam > freebayes.vcf
#homo
cat freebayes.vcf | vcffilter -g "GT = 1/1" | vcffixup - | vcffilter -f "AC > 0" > freebayes_homo.vcf
#圧縮
bgzip freebayes_homo.vcf
bcftools mpileup
- -O, --output-type <TYPE> 'b' compressed BCF; 'u' uncompressed BCF;
'z' compressed VCF; 'v' uncompressed VCF [v] - -o, --output <FILE> write output to FILE [standard output]
- -f, --fasta-ref <FILE> faidx indexed reference sequence file
bcftools call
- -m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c)
- -v, --variants-only output variant sites only
3、normalize indels
bcftools norm -f reference.fa minimap2_calls.vcf.gz -Ob -o minimap2_calls.norm.bcf
bcftools norm -f reference.fa bwa_calls.vcf.gz -Ob -o bwa_calls.norm.bcf
bcftools norm -f reference.fa smalt_calls.vcf.gz -Ob -o smalt_calls.norm.bcf
bcftools norm -f reference.fa bowtie2_calls.vcf.gz -Ob -o bowite2_calls.norm.bcf
bcftools norm -f reference.fa bowtie2_calls.vcf.gz -Ob -o bowite2_calls.norm.bcf
4、filter adjacent indels within 5bp
-Ozをつけて出力はvcf.gzとする
bcftools filter --IndelGap 5 minimap2_calls.norm.bcf -Oz -o minimap2_calls.norm.flt-indels.vcf.gz
bcftools filter --IndelGap 5 bwa_calls.norm.bcf -Oz -o bwa_calls.norm.flt-indels.vcf.gz
bcftools filter --IndelGap 5 smalt_calls.norm.bcf -Oz -o smalt_calls.norm.flt-indels.vcf.gz
bcftools filter --IndelGap 5 bowtie2_calls.vcf.gz -Oz -o bowtie2_calls.norm.flt-indels.vcf.gz
#indexing
bcftools index minimap2_calls.norm.flt-indels.vcf.gz
bcftools index bwa_calls.norm.flt-indels.vcf.gz
bcftools index smalt_calls.norm.flt-indels.vcf.gz
bcftools index bowtie2_calls.norm.flt-indels.vcf.gz
5、 consensus call
cat reference.fa | bcftools consensus minimap2_calls.norm.flt-indels.vcf.gz > minimap2_consensus.fa
cat reference.fa | bcftools consensus bwa_calls.norm.flt-indels.vcf.gz > bwa_consensus.fa
cat reference.fa | bcftools consensus smalt_calls.norm.flt-indels.vcf.gz > smalt_consensus.fa
cat reference.fa | bcftools consensus bowtie2_calls.norm.flt-indels.vcf.gz > bowtie2_consensus.fa
参考
http://samtools.sourceforge.net/cns0.shtml
samtools issue #899
How to use samtools and bcftools to get consensus sequence
https://github.com/samtools/samtools/issues/899
追記
パイルアップしてコンセンサスコール、fastq出力(注意。廃止されたコマンドもある)。
samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
bcftools callのploidyはdefaultで2になっている。プリセットのploidyを調べるには
"bcftools call --ploidy print"を打つ。
関連
2021 6/2
こちらの査読前論文では、 bwa mem => samtools sort => Deepvariantでバリアントコールしてコンセンセス配列を出力することでエラーをポリッシュしたゲノムを作成しています。