macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

バリアントコーラー FreeBayes

2021 4/28 例追加

2021 5/15 追記


 ハプロタイプベースのバリアント検出方法は、一度に1つの位置で動作する方法に比べて多くの利点を提供する。ハプロタイプベースの方法は、同じコンテキストで対立遺伝子のすべてのクラスを同時に評価することにより、記述されたバリアント間の意味の一貫性を保証する。局所的に段階的な遺伝子型データを使用すると、考慮しなければならないハプロタイプの可能なスペースを減らすことにより、遺伝子型推定の計算負荷を減らすことができる。ローカルにフェージングされた遺伝子型を使用して、連鎖情報がまばらであるために代入するのが難しいまれなバリエーションのコンテキストで、ジェノタイピングの精度を向上させることができる。同様に、それらはジェノタイピングアッセイの設計を支援することができるが、変異の状況次第では失敗する可能性がある。シーケンシングエラーが独立している場合、バリアント検出でより長いハプロタイプを使用すると、分析で使用される遺伝子型尤度空間のシグナル対ノイズ比を高めることにより、検出を改善できる。これは、誤ったハプロタイプの可能性のある空間がハプロタイプの長さとともに劇的に拡大する一方で、真の変異の空間は一定のままであり、真の対立遺伝子の数は特定の遺伝子座でのサンプルの倍数性以下であるという事実に由来する。
 アラインメントデータからのハプロタイプの直接検出は、既存のバリアント検出方法にいくつかの課題を提示する。ハプロタイプの長さが増加するにつれて、ハプロタイプ内の可能な対立遺伝子の数も増加するため、統一された状況でハプロタイプの遺伝的変異を検出するように設計された方法は、多対立遺伝子をモデル化できなければならない。ただし、ほとんどのバリアント検出方法は、二対立遺伝子および均一な(通常は2倍体の)コピー数を想定する統計モデルを使用して、特定の遺伝子座での多型の可能性の推定値を確立する。さらに、コピー数の不適切なモデリングは、性染色体、倍数体生物、または対立遺伝子、遺伝子型、可能性と呼ばれる既知のコピー数多型のある場所での小さな変異の正確な検出を妨げる。ローカルコピー数とグローバル倍数性を反映する必要がある。
 集団レベルの推論法をハプロタイプの検出に適用できるようにするために、Marth et alによって記述されたベイズ統計法を一般化する。検討中のサンプル全体で、多対立遺伝子座と不均一なコピー数を可能にする。このモデルはFreeBayesとして実装されている。

 

インストール

Github

#bioconda (link)
mamba install -c bioconda freebayes -y

freebayes -h

$ freebayes -h

usage: freebayes [OPTION] ... [BAM FILE] ... 

 

Bayesian haplotype-based polymorphism discovery.

 

citation: Erik Garrison, Gabor Marth

          "Haplotype-based variant detection from short-read sequencing"

          arXiv:1207.3907 (http://arxiv.org/abs/1207.3907)

 

overview:

 

    To call variants from aligned short-read sequencing data, supply BAM files and

    a reference.  FreeBayes will provide VCF output on standard out describing SNPs,

    indels, and complex variants in samples in the input alignments.

 

    By default, FreeBayes will consider variants supported by at least 2

    observations in a single sample (-C) and also by at least 20% of the reads from

    a single sample (-F).  These settings are suitable to low to high depth

    sequencing in haploid and diploid samples, but users working with polyploid or

    pooled samples may wish to adjust them depending on the characteristics of

    their sequencing data.

 

    FreeBayes is capable of calling variant haplotypes shorter than a read length

    where multiple polymorphisms segregate on the same read.  The maximum distance

    between polymorphisms phased in this way is determined by the

    --max-complex-gap, which defaults to 3bp.  In practice, this can comfortably be

    set to half the read length.

 

    Ploidy may be set to any level (-p), but by default all samples are assumed to

    be diploid.  FreeBayes can model per-sample and per-region variation in

    copy-number (-A) using a copy-number variation map.

 

    FreeBayes can act as a frequency-based pooled caller and describe variants

    and haplotypes in terms of observation frequency rather than called genotypes.

    To do so, use --pooled-continuous and set input filters to a suitable level.

    Allele observation counts will be described by AO and RO fields in the VCF output.

 

 

examples:

 

    # call variants assuming a diploid sample

    freebayes -f ref.fa aln.bam >var.vcf

 

    # call variants assuming a diploid sample, providing gVCF output

    freebayes -f ref.fa --gvcf aln.bam >var.gvcf

 

    # require at least 5 supporting observations to consider a variant

    freebayes -f ref.fa -C 5 aln.bam >var.vcf

 

    # discard alignments overlapping positions where total read depth is greater than 200

    freebayes -f ref.fa -g 200 aln.bam >var.vcf

 

    # use a different ploidy

    freebayes -f ref.fa -p 4 aln.bam >var.vcf

 

    # assume a pooled sample with a known number of genome copies

    freebayes -f ref.fa -p 20 --pooled-discrete aln.bam >var.vcf

 

    # generate frequency-based calls for all variants passing input thresholds

    freebayes -f ref.fa -F 0.01 -C 1 --pooled-continuous aln.bam >var.vcf

 

    # use an input VCF (bgzipped + tabix indexed) to force calls at particular alleles

    freebayes -f ref.fa -@ in.vcf.gz aln.bam >var.vcf

 

    # generate long haplotype calls over known variants

    freebayes -f ref.fa --haplotype-basis-alleles in.vcf.gz \ 

                        --haplotype-length 50 aln.bam

 

    # naive variant calling: simply annotate observation counts of SNPs and indels

    freebayes -f ref.fa --haplotype-length 0 --min-alternate-count 1 \ 

        --min-alternate-fraction 0 --pooled-continuous --report-monomorphic >var.vcf

 

 

parameters:

 

   -h --help       Prints this help dialog.

   --version       Prints the release number and the git commit id.

 

input:

 

   -b --bam FILE   Add FILE to the set of BAM files to be analyzed.

   -L --bam-list FILE

                   A file containing a list of BAM files to be analyzed.

   -c --stdin      Read BAM input on stdin.

   -f --fasta-reference FILE

                   Use FILE as the reference sequence for analysis.

                   An index file (FILE.fai) will be created if none exists.

                   If neither --targets nor --region are specified, FreeBayes

                   will analyze every position in this reference.

   -t --targets FILE

                   Limit analysis to targets listed in the BED-format FILE.

   -r --region <chrom>:<start_position>-<end_position>

                   Limit analysis to the specified region, 0-base coordinates,

                   end_position not included (same as BED format).

                   Either '-' or '..' maybe used as a separator.

   -s --samples FILE

                   Limit analysis to samples listed (one per line) in the FILE.

                   By default FreeBayes will analyze all samples in its input

                   BAM files.

   --populations FILE

                   Each line of FILE should list a sample and a population which

                   it is part of.  The population-based bayesian inference model

                   will then be partitioned on the basis of the populations.

   -A --cnv-map FILE

                   Read a copy number map from the BED file FILE, which has

                   either a sample-level ploidy:

                      sample_name copy_number

                   or a region-specific format:

                      seq_name start end sample_name copy_number

                   ... for each region in each sample which does not have the

                   default copy number as set by --ploidy. These fields can be delimited

                   by space or tab.

 

output:

 

   -v --vcf FILE   Output VCF-format results to FILE. (default: stdout)

   --gvcf

                   Write gVCF output, which indicates coverage in uncalled regions.

   --gvcf-chunk NUM

                   When writing gVCF output emit a record for every NUM bases.

   -& --gvcf-dont-use-chunk BOOL 

                   When writing the gVCF output emit a record for all bases if

                   set to "true" , will also route an int to --gvcf-chunk

                   similar to --output-mode EMIT_ALL_SITES from GATK

   -@ --variant-input VCF

                   Use variants reported in VCF file as input to the algorithm.

                   Variants in this file will included in the output even if

                   there is not enough support in the data to pass input filters.

   -l --only-use-input-alleles

                   Only provide variant calls and genotype likelihoods for sites

                   and alleles which are provided in the VCF input, and provide

                   output in the VCF for all input alleles, not just those which

                   have support in the data.

   --haplotype-basis-alleles VCF

                   When specified, only variant alleles provided in this input

                   VCF will be used for the construction of complex or haplotype

                   alleles.

   --report-all-haplotype-alleles

                   At sites where genotypes are made over haplotype alleles,

                   provide information about all alleles in output, not only

                   those which are called.

   --report-monomorphic

                   Report even loci which appear to be monomorphic, and report all

                   considered alleles, even those which are not in called genotypes.

                   Loci which do not have any potential alternates have '.' for ALT.

   -P --pvar N     Report sites if the probability that there is a polymorphism

                   at the site is greater than N.  default: 0.0.  Note that post-

                   filtering is generally recommended over the use of this parameter.

   --strict-vcf

                   Generate strict VCF format (FORMAT/GQ will be an int)

 

population model:

 

   -T --theta N    The expected mutation rate or pairwise nucleotide diversity

                   among the population under analysis.  This serves as the

                   single parameter to the Ewens Sampling Formula prior model

                   default: 0.001

   -p --ploidy N   Sets the default ploidy for the analysis to N.  default: 2

   -J --pooled-discrete

                   Assume that samples result from pooled sequencing.

                   Model pooled samples using discrete genotypes across pools.

                   When using this flag, set --ploidy to the number of

                   alleles in each sample or use the --cnv-map to define

                   per-sample ploidy.

   -K --pooled-continuous

                   Output all alleles which pass input filters, regardles of

                   genotyping outcome or model.

 

reference allele:

 

   -Z --use-reference-allele

                   This flag includes the reference allele in the analysis as

                   if it is another sample from the same population.

   --reference-quality MQ,BQ

                   Assign mapping quality of MQ to the reference allele at each

                   site and base quality of BQ.  default: 100,60

 

allele scope:

 

   -n --use-best-n-alleles N

                   Evaluate only the best N SNP alleles, ranked by sum of

                   supporting quality scores.  (Set to 0 to use all; default: all)

   -E --max-complex-gap N

      --haplotype-length N

                   Allow haplotype calls with contiguous embedded matches of up

                   to this length. Set N=-1 to disable clumping. (default: 3)

   --min-repeat-size N

                   When assembling observations across repeats, require the total repeat

                   length at least this many bp.  (default: 5)

   --min-repeat-entropy N

                   To detect interrupted repeats, build across sequence until it has

                   entropy > N bits per bp. Set to 0 to turn off. (default: 1)

   --no-partial-observations

                   Exclude observations which do not fully span the dynamically-determined

                   detection window.  (default, use all observations, dividing partial

                   support across matching haplotypes when generating haplotypes.)

 

  These flags are meant for testing.

  They are not meant for filtering the output.

  They actually filter the input to the algorithm by throwing away alignments.

  This hurts performance by hiding information from the Bayesian model.

  Do not use them unless you can validate that they improve results!

 

   -I --throw-away-snp-obs     Remove SNP observations from input.

   -i --throw-away-indels-obs  Remove indel observations from input.

   -X --throw-away-mnp-obs     Remove MNP observations from input.

   -u --throw-away-complex-obs Remove complex allele observations from input.

 

  If you need to break apart haplotype calls to obtain one class of alleles,

  run the call with default parameters, then normalize and subset the VCF:

    freebayes ... | vcfallelicprimitives -kg >calls.vcf

  For example, this would retain only biallelic SNPs.

    <calls.vcf vcfsnps | vcfbiallelic >biallelic_snp_calls.vcf

 

indel realignment:

 

   -O --dont-left-align-indels

                   Turn off left-alignment of indels, which is enabled by default.

 

input filters:

 

   -4 --use-duplicate-reads

                   Include duplicate-marked alignments in the analysis.

                   default: exclude duplicates marked as such in alignments

   -m --min-mapping-quality Q

                   Exclude alignments from analysis if they have a mapping

                   quality less than Q.  default: 1

   -q --min-base-quality Q

                   Exclude alleles from analysis if their supporting base

                   quality is less than Q.  default: 0

   -R --min-supporting-allele-qsum Q

                   Consider any allele in which the sum of qualities of supporting

                   observations is at least Q.  default: 0

   -Y --min-supporting-mapping-qsum Q

                   Consider any allele in which and the sum of mapping qualities of

                   supporting reads is at least Q.  default: 0

   -Q --mismatch-base-quality-threshold Q

                   Count mismatches toward --read-mismatch-limit if the base

                   quality of the mismatch is >= Q.  default: 10

   -U --read-mismatch-limit N

                   Exclude reads with more than N mismatches where each mismatch

                   has base quality >= mismatch-base-quality-threshold.

                   default: ~unbounded

   -z --read-max-mismatch-fraction N

                   Exclude reads with more than N [0,1] fraction of mismatches where

                   each mismatch has base quality >= mismatch-base-quality-threshold

                   default: 1.0

   -$ --read-snp-limit N

                   Exclude reads with more than N base mismatches, ignoring gaps

                   with quality >= mismatch-base-quality-threshold.

                   default: ~unbounded

   -e --read-indel-limit N

                   Exclude reads with more than N separate gaps.

                   default: ~unbounded

   -0 --standard-filters  Use stringent input base and mapping quality filters

                   Equivalent to -m 30 -q 20 -R 0 -S 0

   -F --min-alternate-fraction N

                   Require at least this fraction of observations supporting

                   an alternate allele within a single individual in the

                   in order to evaluate the position.  default: 0.05

   -C --min-alternate-count N

                   Require at least this count of observations supporting

                   an alternate allele within a single individual in order

                   to evaluate the position.  default: 2

   -3 --min-alternate-qsum N

                   Require at least this sum of quality of observations supporting

                   an alternate allele within a single individual in order

                   to evaluate the position.  default: 0

   -G --min-alternate-total N

                   Require at least this count of observations supporting

                   an alternate allele within the total population in order

                   to use the allele in analysis.  default: 1

   --min-coverage N

                   Require at least this coverage to process a site. default: 0

   --limit-coverage N

                   Downsample per-sample coverage to this level if greater than this coverage.

                   default: no limit

   -g --skip-coverage N

                   Skip processing of alignments overlapping positions with coverage >N.

                   This filters sites above this coverage, but will also reduce data nearby.

                   default: no limit

 

population priors:

 

   -k --no-population-priors

                   Equivalent to --pooled-discrete --hwe-priors-off and removal of

                   Ewens Sampling Formula component of priors.

 

mappability priors:

 

   -w --hwe-priors-off

                   Disable estimation of the probability of the combination

                   arising under HWE given the allele frequency as estimated

                   by observation frequency.

   -V --binomial-obs-priors-off

                   Disable incorporation of prior expectations about observations.

                   Uses read placement probability, strand balance probability,

                   and read position (5'-3') probability.

   -a --allele-balance-priors-off

                   Disable use of aggregate probability of observation balance between alleles

                   as a component of the priors.

 

genotype likelihoods:

 

   --observation-bias FILE

                   Read length-dependent allele observation biases from FILE.

                   The format is [length] [alignment efficiency relative to reference]

                   where the efficiency is 1 if there is no relative observation bias.

   --base-quality-cap Q

                   Limit estimated observation quality by capping base quality at Q.

   --prob-contamination F

                   An estimate of contamination to use for all samples.  default: 10e-9

   --legacy-gls    Use legacy (polybayes equivalent) genotype likelihood calculations

   --contamination-estimates FILE

                   A file containing per-sample estimates of contamination, such as

                   those generated by VerifyBamID.  The format should be:

                       sample p(read=R|genotype=AR) p(read=A|genotype=AA)

                   Sample '*' can be used to set default contamination estimates.

 

algorithmic features:

 

   --report-genotype-likelihood-max

                   Report genotypes using the maximum-likelihood estimate provided

                   from genotype likelihoods.

   -B --genotyping-max-iterations N

                   Iterate no more than N times during genotyping step. default: 1000.

   --genotyping-max-banddepth N

                   Integrate no deeper than the Nth best genotype by likelihood when

                   genotyping. default: 6.

   -W --posterior-integration-limits N,M

                   Integrate all genotype combinations in our posterior space

                   which include no more than N samples with their Mth best

                   data likelihood. default: 1,3.

   -N --exclude-unobserved-genotypes

                   Skip sample genotypings for which the sample has no supporting reads.

   -S --genotype-variant-threshold N

                   Limit posterior integration to samples where the second-best

                   genotype likelihood is no more than log(N) from the highest

                   genotype likelihood for the sample.  default: ~unbounded

   -j --use-mapping-quality

                   Use mapping quality of alleles when calculating data likelihoods.

   -H --harmonic-indel-quality

                   Use a weighted sum of base qualities around an indel, scaled by the

                   distance from the indel.  By default use a minimum BQ in flanking sequence.

   -D --read-dependence-factor N

                   Incorporate non-independence of reads by scaling successive

                   observations by this factor during data likelihood

                   calculations.  default: 0.9

   -= --genotype-qualities

                   Calculate the marginal probability of genotypes and report as GQ in

                   each sample field in the VCF output.

 

debugging:

 

   -d --debug      Print debugging output.

   -dd             Print more verbose debugging output (requires "make DEBUG")

 

 

author:   Erik Garrison <erik.garrison@gmail.com>

version:  v1.3.2-dirty

 

実行方法

リファレンスゲノムのfastaとアラインメントして得たbamファイルを指定する。

freebayes -f ref.fa aln.bam > var.vcf

#Call variants on only chrQ, from position 1000 to 2000:
freebayes -f ref.fa -r chrQ:1000-2000 aln.bam >var.vcf

#Require at least 5 supporting observations to consider a variant:
freebayes -f ref.fa -C 5 aln.bam > var.vcf

#freebayes-parallelで20並列
freebayes-parallel <(fasta_generate_regions.py ref.fa.fai 100000) 20 \
-f ref.fa aln.bam > var.vcf
  • -C   Require at least this count of observations supporting an alternate allele within a single individual in order to evaluate the position. default: 2 

 

実行例

1、bwa memでmapping、elprepでPCR duplicates(PCRをライブラリ作成で使っている場合)、MAPQ=0を除去しながらcoordinate sortしたbamを出力する。

#sample1
minimap2 -ax sr -R "@RG\tID:sample1\tLB:library\tSM:sample1\tPL:ILLUMINA" -t 12\
genome.fasta sample1_R1.fq.gz sample1_R2.fq.gz \
| elprep filter /dev/stdin sample1.bam --mark-duplicates --remove-duplicates --filter-mapping-quality 0 --clean-sam --nr-of-threads 12 --sorting-order coordinate --filter-unmapped-reads-strict
samtools index -@ 8 sample1.bam

#sample2
minimap2 -ax sr -R "@RG\tID:sample2\tLB:library\tSM:sample2\tPL:ILLUMINA" -t 12\
genome.fasta sample2_R1.fq.gz sample2_R2.fq.gz \
| elprep filter /dev/stdin sample2.bam --mark-duplicates --remove-duplicates --filter-mapping-quality 0 --clean-sam --nr-of-threads 12 --sorting-order coordinate --filter-unmapped-reads-strict
samtools index -@ 8 sample2.bam

#sample3
minimap2 -ax sr -R "@RG\tID:sample3\tLB:library\tSM:sample3\tPL:ILLUMINA" -t 12\
genome.fasta sample3_R1.fq.gz sample3_R2.fq.gz \
| elprep filter /dev/stdin sample3.bam --mark-duplicates --remove-duplicates --filter-mapping-quality 0 --clean-sam --nr-of-threads 12 --sorting-order coordinate --filter-unmapped-reads-strict
samtools index -@ 8 sample3.bam

*VCFのサンプル名に使われるのはSM:sample1の部分。しかしID:の部分もユニークな名前にしておかないとバリアントコーラーがエラーになる。

 

2、freebayesでjoint calling。ターゲットはchr1のみとし、ploidyは2、カバレッジ5以上、カバレッジに対するバリアントの比率10%以上、1000x以上のextreme coverage領域はコール対象外とする。

freebayes -f ref.fa -p 2 -g 1000 -r chr1 -F 0.1 -C 5 --gvcf sample1.bam sample2.bam sample3.bam > variants.vcf

#圧縮してindexをつける
bgzip variants.vcf
tabix -p vcf variants.vcf.gz

#さらにfilteringをかける。vcflib(bioconda)を使うなら(インストールは"conda install -c bioconda vcflib")
vcffilter -f "QUAL > 20" variants.vcf.gz > filtered.vcf

#vcfのサマリー
rtg vcfstats variants.vcf
rtg vcfstats filtered.vcf

#vtを使う
vt peek variants.vcf
vt peek filtered.vcf
  • --gvcf    Write gVCF output, which indicates coverage in uncalled regions.
  • -g    Skip processing of alignments overlapping positions with coverage >N. This filters sites above this coverage, but will also reduce data nearby. default: no limit
  • -p   N Sets the default ploidy for the analysis to N. default: 2
  • -F   Require at least this fraction of observations supporting an alternate allele within a single individual in the in order to evaluate the position. default: 0.05

 

2021 4/28

joint calling

freebayes -C 2 -u -p 4 -f assembly.fasta \
mapping/SRRXXXX0679.bam\
mapping/SRRXXXX0680.bam\
mapping/SRRXXXX0681.bam\
mapping/SRRXXXX0682.bam\
mapping/SRRXXXX0683.bam\
mapping/SRRXXXX0684.bam\
mapping/SRRXXXX0685.bam\
mapping/SRRXXXX0686.bam\
mapping/SRRXXXX0687.bam\
mapping/SRRXXXX0688.bam\
mapping/SRRXXXX0689.bam\
mapping/SRRXXXX0690.bam\
mapping/SRRXXXX0691.bam\
mapping/SRRXXXX0692.bam\
mapping/SRRXXXX0693.bam\
mapping/SRRXXXX0694.bam\
mapping/SRRXXXX0695.bam\
mapping/SRRXXXX0696.bam\
mapping/SRRXXXX0697.bam\
mapping/SRRXXXX0698.bam\
mapping/SRRXXXX0699.bam > joint-call.vcf

bgzip joint-call.vcf
tabix -p vcf joint-call.vcf.gz

#filtering (QUALでフィルタリングすると、全サンプルで共通するようなバリアントだけ残る、少数のサンプルで固有の変異は除外される *間違い)
vcffilter -f 'QUAL > 10' joint-call.vcf.gz > filterd_joint-call.vcf

#IGV

igv -g assembly.fasta filterd_joint-call.vcf

f:id:kazumaxneo:20210428174321p:plain


IGVではバリアントの種類によって色分けされる。色はpreferenceから変更できる。

f:id:kazumaxneo:20210428174834p:plain

引用

Haplotype-based variant detection from short-read sequencing
Erik Garrison and Gabor Marth

arXiv, Submitted on 17 Jul 2012 (v1), last revised 20 Jul 2012 (this version, v2)

 

参考

https://www.biostars.org/p/85400/

 

関連