macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

マッピングからコンセンサス配列を出力するbcftoolsのconsensusコマンド

2019 8/5 bcftools help追加

2019 8/30追記

2019 11/11追記

2020 3/20 bowtiee2コマンド修正

2021 5/24 dockerhubのイメージへのリンク追加

 

変異株のリファレンスをゲノムに当て、その個体についてコンセンサス配列を作成したいことがある。 これはbcftoolsのconsensusコマンドを使って実行可能である。 

https://samtools.github.io/bcftools/howtos/consensus-sequence.html

 

インストール

bcftoolsとsamtolosはcondaで導入できる。 バージョンによって使えるサブコマンドが異なるので注意する。

#bioconda (link)
conda install -c bioconda -y bcftools

#docker image(biocontainers)
#v1.9
docker pull biocontainers/bcftools:v1.9-1-deb_cv1
#v1.5
docker pull biocontainers/bcftools:v1.5_cv3

#docker image(mgibio/bcftools-cwl)
#v1.12
docker pull mgibio/bcftools-cwl:1.12

bcftools

$ bcftools

 

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)

Version: 1.9 (using htslib 1.9)

 

Usage:   bcftools [--version|--version-only] [--help] <command> <argument>

 

Commands:

 

 -- Indexing

    index        index VCF/BCF files

 

 -- VCF/BCF manipulation

    annotate     annotate and edit VCF/BCF files

    concat       concatenate VCF/BCF files from the same set of samples

    convert      convert VCF/BCF files to different formats and back

    isec         intersections of VCF/BCF files

    merge        merge VCF/BCF files files from non-overlapping sample sets

    norm         left-align and normalize indels

    plugin       user-defined plugins

    query        transform VCF/BCF into user-defined formats

    reheader     modify VCF/BCF header, change sample names

    sort         sort VCF/BCF file

    view         VCF/BCF conversion, view, subset and filter VCF/BCF files

 

 -- VCF/BCF analysis

    call         SNP/indel calling

    consensus    create consensus sequence by applying VCF variants

    cnv          HMM CNV calling

    csq          call variation consequences

    filter       filter VCF/BCF files using fixed thresholds

    gtcheck      check sample concordance, detect sample swaps and contamination

    mpileup      multi-way pileup producing genotype likelihoods

    roh          identify runs of autozygosity (HMM)

    stats        produce VCF/BCF stats

 

 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected

 automatically even when streaming from a pipe. Indexed VCF and BCF will work

 in all situations. Un-indexed VCF and BCF and streams will work in most but

 not all situations.

bcftools mpileup

$ bcftools mpileup

 

Usage: bcftools mpileup [options] in1.bam [in2.bam [...]]

 

Input options:

  -6, --illumina1.3+      quality is in the Illumina-1.3+ encoding

  -A, --count-orphans     do not discard anomalous read pairs

  -b, --bam-list FILE     list of input BAM filenames, one per line

  -B, --no-BAQ            disable BAQ (per-Base Alignment Quality)

  -C, --adjust-MQ INT     adjust mapping quality; recommended:50, disable:0 [0]

  -d, --max-depth INT     max per-file depth; avoids excessive memory usage [250]

  -E, --redo-BAQ          recalculate BAQ on the fly, ignore existing BQs

  -f, --fasta-ref FILE    faidx indexed reference sequence file

      --no-reference      do not require fasta reference file

  -G, --read-groups FILE  select or exclude read groups listed in the file

  -q, --min-MQ INT        skip alignments with mapQ smaller than INT [0]

  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]

  -r, --regions REG[,...] comma separated list of regions in which pileup is generated

  -R, --regions-file FILE restrict to regions listed in a file

      --ignore-RG         ignore RG tags (one BAM = one sample)

  --rf, --incl-flags STR|INT  required flags: skip reads with mask bits unset

  --ff, --excl-flags STR|INT  filter flags: skip reads with mask bits set

                                            [UNMAP,SECONDARY,QCFAIL,DUP]

  -s, --samples LIST      comma separated list of samples to include

  -S, --samples-file FILE file of samples to include

  -t, --targets REG[,...] similar to -r but streams rather than index-jumps

  -T, --targets-file FILE similar to -R but streams rather than index-jumps

  -x, --ignore-overlaps   disable read-pair overlap detection

 

Output options:

  -a, --annotate LIST     optional tags to output; '?' to list

  -g, --gvcf INT[,...]    group non-variant sites into gVCF blocks according

                          to minimum per-sample DP

      --no-version        do not append version and command line to the header

  -o, --output FILE       write output to FILE [standard output]

  -O, --output-type TYPE  'b' compressed BCF; 'u' uncompressed BCF;

                          'z' compressed VCF; 'v' uncompressed VCF [v]

      --threads INT       number of extra output compression threads [0]

 

SNP/INDEL genotype likelihoods options:

  -e, --ext-prob INT      Phred-scaled gap extension seq error probability [20]

  -F, --gap-frac FLOAT    minimum fraction of gapped reads [0.002]

  -h, --tandem-qual INT   coefficient for homopolymer errors [100]

  -I, --skip-indels       do not perform indel calling

  -L, --max-idepth INT    maximum per-file depth for INDEL calling [250]

  -m, --min-ireads INT    minimum number gapped reads for indel candidates [1]

  -o, --open-prob INT     Phred-scaled gap open seq error probability [40]

  -p, --per-sample-mF     apply -m and -F per-sample for increased sensitivity

  -P, --platforms STR     comma separated list of platforms for indels [all]

 

Notes: Assuming diploid individuals.

 

> bcftools norm

$ bcftools norm

 

About:   Left-align and normalize indels; check if REF alleles match the reference;

         split multiallelic sites into multiple rows; recover multiallelics from

         multiple rows.

Usage:   bcftools norm [options] <in.vcf.gz>

 

Options:

    -c, --check-ref <e|w|x|s>         check REF alleles and exit (e), warn (w), exclude (x), or set (s) bad sites [e]

    -D, --remove-duplicates           remove duplicate lines of the same type.

    -d, --rm-dup <type>               remove duplicate snps|indels|both|all|none

    -f, --fasta-ref <file>            reference sequence (MANDATORY)

    -m, --multiallelics <-|+>[type]   split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both]

        --no-version                  do not append version and command line to the header

    -N, --do-not-normalize            do not normalize indels (with -m or -c s)

    -o, --output <file>               write output to a file [standard output]

    -O, --output-type <type>          'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]

    -r, --regions <region>            restrict to comma-separated list of regions

    -R, --regions-file <file>         restrict to regions listed in a file

    -s, --strict-filter               when merging (-m+), merged site is PASS only if all sites being merged PASS

    -t, --targets <region>            similar to -r but streams rather than index-jumps

    -T, --targets-file <file>         similar to -R but streams rather than index-jumps

        --threads <int>               number of extra (de)compression threads [0]

    -w, --site-win <int>              buffer for sorting lines which changed position during realignment [1000]

> bcftools filter

$ bcftools filter

 

About:   Apply fixed-threshold filters.

Usage:   bcftools filter [options] <in.vcf.gz>

 

Options:

    -e, --exclude <expr>          exclude sites for which the expression is true (see man page for details)

    -g, --SnpGap <int>            filter SNPs within <int> base pairs of an indel

    -G, --IndelGap <int>          filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass

    -i, --include <expr>          include only sites for which the expression is true (see man page for details

    -m, --mode [+x]               "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass

        --no-version              do not append version and command line to the header

    -o, --output <file>           write output to a file [standard output]

    -O, --output-type <b|u|z|v>   b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]

    -r, --regions <region>        restrict to comma-separated list of regions

    -R, --regions-file <file>     restrict to regions listed in a file

    -s, --soft-filter <string>    annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+")

    -S, --set-GTs <.|0>           set genotypes of failed samples to missing (.) or ref (0)

    -t, --targets <region>        similar to -r but streams rather than index-jumps

    -T, --targets-file <file>     similar to -R but streams rather than index-jumps

        --threads <int>           number of extra output compression threads [0]

bcftools index

 

About:   Index bgzip compressed VCF/BCF files for random access.

Usage:   bcftools index [options] <in.bcf>|<in.vcf.gz>

 

Indexing options:

    -c, --csi                generate CSI-format index for VCF/BCF files [default]

    -f, --force              overwrite index if it already exists

    -m, --min-shift INT      set minimal interval size for CSI indices to 2^INT [14]

    -o, --output-file FILE   optional output index file name

    -t, --tbi                generate TBI-format index for VCF files

        --threads            sets the number of threads [0]

 

Stats options:

    -n, --nrecords       print number of records based on existing index file

    -s, --stats          print per contig stats based on existing index file

 

bcftools call

$ bcftools call

 

About:   SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup.

         This command replaces the former "bcftools view" caller. Some of the original

         functionality has been temporarily lost in the process of transition to htslib,

         but will be added back on popular demand. The original calling model can be

         invoked with the -c option.

Usage:   bcftools call [options] <in.vcf.gz>

 

File format options:

       --no-version                do not append version and command line to the header

   -o, --output <file>             write output to a file [standard output]

   -O, --output-type <b|u|z|v>     output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]

       --ploidy <assembly>[?]      predefined ploidy, 'list' to print available settings, append '?' for details

       --ploidy-file <file>        space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY

   -r, --regions <region>          restrict to comma-separated list of regions

   -R, --regions-file <file>       restrict to regions listed in a file

   -s, --samples <list>            list of samples to include [all samples]

   -S, --samples-file <file>       PED file or a file with an optional column with sex (see man page for details) [all samples]

   -t, --targets <region>          similar to -r but streams rather than index-jumps

   -T, --targets-file <file>       similar to -R but streams rather than index-jumps

       --threads <int>             number of extra output compression threads [0]

 

Input/output options:

   -A, --keep-alts                 keep all possible alternate alleles at variant sites

   -f, --format-fields <list>      output format fields: GQ,GP (lowercase allowed) []

   -F, --prior-freqs <AN,AC>       use prior allele frequencies

   -g, --gvcf <int>,[...]          group non-variant sites into gVCF blocks by minimum per-sample DP

   -i, --insert-missed             output also sites missed by mpileup but present in -T

   -M, --keep-masked-ref           keep sites with masked reference allele (REF=N)

   -V, --skip-variants <type>      skip indels/snps

   -v, --variants-only             output variant sites only

 

Consensus/variant calling options:

   -c, --consensus-caller          the original calling method (conflicts with -m)

   -C, --constrain <str>           one of: alleles, trio (see manual)

   -m, --multiallelic-caller       alternative model for multiallelic and rare-variant calling (conflicts with -c)

   -n, --novel-rate <float>,[...]  likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9]

   -p, --pval-threshold <float>    variant if P(ref|D)<FLOAT with -c [0.5]

   -P, --prior <float>             mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]

 

 

 

実行方法

1、mapping。ここではbwaのmem、minimap2、smalt、bowtie2(sensitive setting)を試す。

#minimap2
minimap2 -t 12 -ax sr reference.fa pair_1.fq.gz pair_2.fq.gz |samtools sort -@ 12 -O BAM > minimap2.bam

#bwa mem
bwa index -a is reference.fa
bwa mem -t 12 reference.fa pair_1.fq.gz pair_2.fq.gz |samtools sort -@ 12 -O BAM > bwa.bam

#smalt (link)
smalt index smalt_index reference.fa
smalt map -n 18 smalt_index pair_1.fq.gz pair_2.fq.gz |samtools sort -@ 12 -O BAM - > smalt.bam

#bowtie2
bowtie2-build -f reference.fa bowtie2_index
bowtie2 -p 12 --sensitive -x bowtie2_index -1 pair_1.fq.gz -2 pair_2.fq.gz |samtools sort -@ 12 -O BAM -> bowtie2.bam

  

2、variant call

#variant call
bcftools mpileup -Ou -f reference.fa minimap2.bam | bcftools call -mv -Oz -o minimap2_calls.vcf.gz

bcftools mpileup -Ou -f reference.fa bwa.bam | bcftools call -mv -Oz -o bwa_calls.vcf.gz

bcftools mpileup -Ou -f reference.fa smalt.bam | bcftools call -mv -Oz -o smalt_calls.vcf.gz

bcftools mpileup -Ou -f reference.fa bowtie2.bam | bcftools call -mv -Oz -o bowtie2_calls.vcf.gz

#indexing
bcftools index minimap2_calls.vcf.gz
bcftools index bwa_calls.vcf.gz
bcftools index smalt_calls.vcf.gz
bcftools index bowtie2_calls.vcf.gz

#分岐ポイント
#freebayesを使ってhomoの変異だけ残す。例えばdepth≥4対象にする。
freebayes -C 4 -u -p 2 -f reference.fa mapped.bam > freebayes.vcf
#homo
cat freebayes.vcf | vcffilter -g "GT = 1/1" | vcffixup - | vcffilter -f "AC > 0" > freebayes_homo.vcf
#圧縮
bgzip freebayes_homo.vcf

bcftools mpileup

  • -O, --output-type <TYPE>   'b' compressed BCF; 'u' uncompressed BCF;
    'z' compressed VCF; 'v' uncompressed VCF [v]
  • -o, --output  <FILE>   write output to FILE [standard output]
  • -f, --fasta-ref <FILE>   faidx indexed reference sequence file

bcftools call

  • -m, --multiallelic-caller   alternative model for multiallelic and rare-variant calling (conflicts with -c)
  • -v, --variants-only   output variant sites only

 

 

 3、normalize indels

bcftools norm -f reference.fa minimap2_calls.vcf.gz -Ob -o minimap2_calls.norm.bcf

bcftools norm -f reference.fa bwa_calls.vcf.gz -Ob -o bwa_calls.norm.bcf

bcftools norm -f reference.fa smalt_calls.vcf.gz -Ob -o smalt_calls.norm.bcf

bcftools norm -f reference.fa bowtie2_calls.vcf.gz -Ob -o bowite2_calls.norm.bcf

bcftools norm -f reference.fa bowtie2_calls.vcf.gz -Ob -o bowite2_calls.norm.bcf

 

4、filter adjacent indels within 5bp

-Ozをつけて出力はvcf.gzとする

bcftools filter --IndelGap 5 minimap2_calls.norm.bcf -Oz -o minimap2_calls.norm.flt-indels.vcf.gz

bcftools filter --IndelGap 5 bwa_calls.norm.bcf -Oz -o bwa_calls.norm.flt-indels.vcf.gz

bcftools filter --IndelGap 5 smalt_calls.norm.bcf -Oz -o smalt_calls.norm.flt-indels.vcf.gz

bcftools filter --IndelGap 5 bowtie2_calls.vcf.gz -Oz -o bowtie2_calls.norm.flt-indels.vcf.gz

#indexing
bcftools index minimap2_calls.norm.flt-indels.vcf.gz
bcftools index bwa_calls.norm.flt-indels.vcf.gz
bcftools index smalt_calls.norm.flt-indels.vcf.gz
bcftools index bowtie2_calls.norm.flt-indels.vcf.gz

 

5、 consensus call

cat reference.fa | bcftools consensus minimap2_calls.norm.flt-indels.vcf.gz > minimap2_consensus.fa

cat reference.fa | bcftools consensus bwa_calls.norm.flt-indels.vcf.gz > bwa_consensus.fa

cat reference.fa | bcftools consensus smalt_calls.norm.flt-indels.vcf.gz > smalt_consensus.fa

cat reference.fa | bcftools consensus bowtie2_calls.norm.flt-indels.vcf.gz > bowtie2_consensus.fa

 

参考

http://samtools.sourceforge.net/cns0.shtml

 

samtools issue #899

How to use samtools and bcftools to get consensus sequence

https://github.com/samtools/samtools/issues/899

 

 

追記

パイルアップしてコンセンサスコール、fastq出力(注意。廃止されたコマンドもある)。

samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq

bcftools callのploidyはdefaultで2になっている。プリセットのploidyを調べるには

"bcftools call --ploidy print"を打つ。

関連


 

2021 6/2

こちらの査読前論文では、 bwa mem => samtools sort => Deepvariantでバリアントコールしてコンセンセス配列を出力することでエラーをポリッシュしたゲノムを作成しています。