2019 4/16 誤字修正
2020 4/15 インストール追記
2021 4/8 論文引用
ハプロタイプベースのアプローチは、生殖系列のバリアントをコールするための選択方法として浮かび上がってきた。なぜなら、これらの方法は、リードマッパーからのアライメントエラーに対して強く、ポジショナルアプローチ1-7(ref.1 Platypus paper)よりも優れたシグナル - ノイズ特性(S/N)を有するからである。しかしながら、既存のハプロタイプベースのバリアントコーラーにはいくつかの限界がある。第1に、既存のツールは、二倍体(ref.1-3)または一定のコピーナンバー(ref.4-6)のいずれかを仮定するモデルを実装する、ほとんどの場合は多くの問題に対して準最適であり、理想的な無関係な個体集団から標本が選択されると仮定する。このようなモデルは、スモールコホートの生殖系列バリアントをコールするのに適しているが、他の実験デザインで生成されたデータにはpoorにしかフィッティングしない。これには、paired tumours(tumours とnormalのペア)、シングルセル、parent-offspring trios(家族トリオ) などの既知の関連性を有する試料を含む研究、pooled tumour および試料がしばしばheterogeneous であるバクテリアのシーケンシングが含まれる。これらの制限により、研究者はさまざまなコーラーを使い、ポストhoc フィルタリングと統合を伴うカスタムパイプラインを実装して対応している(ref.8-16)。
第2に、既存のハプロタイプベースの方法は、バリアントが非重複領域内で評価される際に、ウィンドウアーチファクトを被る。これは、複雑な領域で、評価対象の領域外にあるバリアントでfalseコールを引き起こす可能性がある。第3に、既存の方法は、リードデータによって支持されたハプロタイプ配列とそれを生じさせた突然変異事象とを明確に区別しない。これは、これらのハプロタイプ配列に適切な事前確率を割り当てることを困難にする。なぜなら、同じハプロタイプ配列を生じさせていても、異なるセットの変異は非常に異なる生物学的尤度を有し得るためである、第4に、ハプロタイプベースの方法は本質的にはバリアントを物理的にphasingすることができるが、既存のツールは2倍体の遺伝子型のphasingに限定され、生殖系列バリアントまたは他の体細胞de novoバリアントに関してpoorにしかフィットしない。
増大するさまざまな実験デザインでのバリアントコールの要求を満たすために、著者らは、統一されたハプロタイプawareフレームワークで異なる遺伝子型モデルを適応させるアルゴリズムを設計した。particle filteringからインスピレーションを得て、通常、他の方法よりも長いハプロタイプを生成する新規なハプロタイプ推論手続きを開発し、アーチファクトをwindowingしてS/N比を改善し、より正確なバリアントコールを実現した。さらに、本発明者らの方法は、同一の配列になるにもかかわらず別個の突然変異イベントからなる突然変異の生物学的妥当性を考慮し、比較することができる。著者らは、事前情報とリード情報の両方を活用し、体細胞突然変異を含む任意のploidyの遺伝型phaseを可能にする確率的なphasingアルゴリズムを提案する。
我々(著者ら)はC ++で記述されたアルゴリズムOctopusを実装する。著者らは、Octopusが、いくつかの一般的な実験デザイン、すなわち個人の生殖細胞バリアントコール、親子トリオのde novoバリアントコール、tumoursの体細胞バリアントコール(paired tumoursの有り無しに関わらず)を専門とする最先端のツールよりも正確であることを示す。 Opsusはhttps://github.com/luntergroup/octopusのMITライセンス下で自由に利用できる。
Overview of the unified haplotype-based algorithm. Preprintより転載。
2018年11月現在、preprintです。
User Document
インストール
依存
- A C++14 compiler with SSE2 support
- A C++14 standard library implementation
- Git 2.5 or greater
- Boost 1.65 or greater
- htslib 1.4 or greater
- CMake 3.9 or greater
- Optional:
- Python3 or greater
本体 Github
#linuxのanaconda環境ならcondaで導入できる。(link)
conda install -y -c conda-forge -c bioconda octopus
#macの場合、Githubを参照。
#またはdocker イメージをpullするかbuildする。buildするなら
git clone https://github.com/luntergroup/octopus.git
cd octopus/
#ビルド(*1)
docker build -t octopus .
#run
docker run --rm -it octopus /tmp/octopus/bin/octopus -h
> octopus -h
$ octopus -h
octopus population calling options:
General:
-h [ --help ] Produce help message
--version Output the version number
--config arg A config file, used to populate command
line options
--debug [=arg(="octopus_debug.log")] Writes verbose debug information to
debug.log in the working directory
--trace [=arg(="octopus_trace.log")] Writes very verbose debug information
to trace.log in the working directory
--fast Turns off some features to improve
runtime, at the cost of decreased
calling accuracy. Equivalent to '-a off
-l minimal -x 50`
--very-fast The same as fast but also disables
inactive flank scoring
Backend:
-w [ --working-directory ] arg Sets the working directory
--threads [=arg(=0)] Maximum number of threads to be used,
enabling this option with no argument
lets the application decide the number
of threads ands enables specific
algorithm parallelisation
-X [ --max-reference-cache-footprint ] arg (=500MB)
Maximum memory footprint for cached
reference sequence
-B [ --target-read-buffer-footprint ] arg (=6GB)
None binding request to limit the
memory footprint of buffered read data
--max-open-read-files arg (=250) Limits the number of read files that
can be open simultaneously
--target-working-memory arg Target working memory footprint for
analysis not including read or
reference footprint
I/O:
-R [ --reference ] arg FASTA format reference genome file to
be analysed. Target regions will be
extracted from the reference index if
not provded explicitly
-I [ --reads ] arg Space-separated list of BAM/CRAM files
to be analysed. May be specified
multiple times
-i [ --reads-file ] arg Files containing lists of BAM/CRAM
files, one per line, to be analysed
--one-based-indexing Notifies that input regions are given
using one based indexing rather than
zero based
-T [ --regions ] arg Space-separated list of regions
(chrom:begin-end) to be analysed. May
be specified multiple times
-t [ --regions-file ] arg File containing a list of regions
(chrom:begin-end), one per line, to be
analysed
-K [ --skip-regions ] arg Space-separated list of regions
(chrom:begin-end) to skip May be
specified multiple times
-k [ --skip-regions-file ] arg File of regions (chrom:begin-end), one
per line, to skip
-S [ --samples ] arg Space-separated list of sample names to
analyse
-s [ --samples-file ] arg File of sample names to analyse, one
per line, which must be a subset of the
samples that appear in the read files
--ignore-unmapped-contigs Ignore any contigs that are not present
in the read files
--pedigree arg PED file containing sample pedigree
-o [ --output ] arg File to where output is written. If
unspecified, calls are written to
stdout
--contig-output-order arg (=asInReferenceIndex)
The order contigs should be written to
the output
--sites-only Only reports call sites (i.e. without
sample genotype information)
--legacy Outputs a legacy version of the final
callset in addition to the native
version
--regenotype arg VCF file specifying calls to
regenotype, only sites in this files
will appear in the final output
--bamout arg Output realigned BAM files
--split-bamout arg Output split realigned BAM files
--data-profile arg Output a profile of polymorphisms and
errors found in the data
Read transformations:
--read-transforms arg (=1) Enable all read transformations
--mask-low-quality-tails [=arg(=3)] Masks read tail bases with base quality
less than this
--mask-tails [=arg(=1)] Unconditionally mask this many read
tail sbases
--soft-clip-masking arg (=1) Turn on or off soft clip base
recalibration
--soft-clip-mask-threshold [=arg(=3)] Only soft clipped bases with quality
less than this will be recalibrated,
rather than all bases
--mask-soft-clipped-boundary-bases arg (=2)
Masks this number of adjacent non soft
clipped bases when soft clipped bases
are present
--adapter-masking arg (=1) Enable adapter detection and masking
--overlap-masking arg (=1) Enable read segment overlap masking
Read filtering:
--read-filtering arg (=1) Enable all read filters
--consider-unmapped-reads Allows reads marked as unmapped to be
used for calling
--min-mapping-quality arg (=20) Minimum read mapping quality required
to consider a read for calling
--good-base-quality arg (=20) Base quality threshold used by
min-good-bases and min-good-base-fracti
on filters
--min-good-base-fraction [=arg(=0.5)] Base quality threshold used by
min-good-bases filter
--min-good-bases arg (=20) Minimum number of bases with quality
min-base-quality before read is
considered
--allow-qc-fails Filters reads marked as QC failed
--min-read-length arg Filters reads shorter than this
--max-read-length arg Filter reads longer than this
--allow-marked-duplicates Allows reads marked as duplicate in
alignment record
--allow-octopus-duplicates Allows reads considered duplicates by
octopus
--allow-secondary-alignments Allows reads marked as secondary
alignments
--allow-supplementary-alignments Allows reads marked as supplementary
alignments
--no-reads-with-unmapped-segments Filter reads with unmapped template
segments to be used for calling
--no-reads-with-distant-segments Filter reads with template segments
that are on different contigs
--no-adapter-contaminated-reads Filter reads with possible adapter
contamination
--disable-downsampling Disables downsampling
--downsample-above arg (=1000) Downsample reads in regions where
coverage is over this
--downsample-target arg (=500) The target coverage for the downsampler
Candidate variant generation:
-g [ --raw-cigar-candidate-generator ] arg (=1)
Enable candidate generation from raw
read alignments (CIGAR strings)
--repeat-candidate-generator arg (=1) Enable candidate generation from
adjusted read alignments (CIGAR
strings) around tandem repeats
-a [ --assembly-candidate-generator ] arg (=1)
Enable candidate generation using local
re-assembly
-c [ --source-candidates ] arg Variant file paths containing known
variants. These variants will
automatically become candidates
--source-candidates-file arg Files containing lists of source
candidate variant files
--min-source-quality [=arg(=2)] Only variants with quality above this
value are considered for candidate
generation
--extract-filtered-source-candidates arg (=0)
Extract variants from source VCF
records that have been filtered
--min-base-quality arg (=20) Only bases with quality above this
value are considered for candidate
generation
--min-supporting-reads [=arg(=2)] Minimum number of reads that must
support a variant if it is to be
considered a candidate. By default
octopus will automatically determine
this value
--max-variant-size arg (=2000) Maximum candidate variant size to
consider (in region space)
--kmer-sizes arg (=10 15 20) Kmer sizes to use for local assembly
--num-fallback-kmers arg (=10) How many local assembly fallback kmer
sizes to use if the default sizes fail
--fallback-kmer-gap arg (=10) The gap size used to generate local
assembly fallback kmers
--max-region-to-assemble arg (=400) The maximum region size that can be
used for local assembly
--max-assemble-region-overlap arg (=200)
The maximum number of bases allowed to
overlap assembly regions
--assemble-all Forces all regions to be assembled
--assembler-mask-base-quality arg (=10)
Aligned bases with quality less than
this will be converted to reference
before being inserted into the De
Bruijn graph
--min-kmer-prune arg (=2) Minimum number of read observations to
keep a kmer in the assembly graph
before bubble extraction
--max-bubbles arg (=30) Maximum number of bubbles to extract
from the assembly graph
--min-bubble-score arg (=2) Minimum bubble score that will be
extracted from the assembly graph
Haplotype generation:
-x [ --max-haplotypes ] arg (=200) Maximum number of candidate haplotypes
the caller may consider. If a region
contains more candidate haplotypes than
this then filtering is applied
--haplotype-holdout-threshold arg (=2500)
Forces the haplotype generator to
temporarily hold out some alleles if
the number of haplotypes in a region
exceeds this threshold
--haplotype-overflow arg (=200000) Regions with more haplotypes than this
will be skipped
--max-holdout-depth arg (=20) Maximum number of holdout attempts the
haplotype generator can make before the
region is skipped
--extension-level arg (=normal) Level of haplotype extension. Possible
values are: conservative, normal,
optimistic, aggressive
-e [ --haplotype-extension-threshold ] arg (=100)
Haplotypes with posterior probability
less than this can be filtered before
extension
--dedup-haplotypes-with-prior-model arg (=1)
Remove duplicate haplotypes using
mutation prior model
--protect-reference-haplotype arg (=1)
Protect the reference haplotype from
filtering
Calling (general):
-C [ --caller ] arg (=population) Which of the octopus callers to use
-P [ --organism-ploidy ] arg (=2) All contigs with unspecified ploidies
are assumed the organism ploidy
-p [ --contig-ploidies ] arg (=Y=1 chrY=1 MT=1 chrM=1)
Space-separated list of contig
(contig=ploidy) or sample contig
(sample:contig=ploidy) ploidies
--contig-ploidies-file arg File containing a list of contig
(contig=ploidy) or sample contig
(sample:contig=ploidy) ploidies, one
per line
--min-variant-posterior arg (=1) Report variant alleles with posterior
probability (phred scale) greater than
this
--refcall [=arg(=blocked)] Caller will report reference confidence
calls for each position (positional),
or in automatically sized blocks
(blocked)
--min-refcall-posterior arg (=2) Report reference alleles with posterior
probability (phred scale) greater than
this
-z [ --snp-heterozygosity ] arg (=0.001)
Germline SNP heterozygosity for the
given samples
--snp-heterozygosity-stdev arg (=0.01)
Standard deviation of the germline SNP
heterozygosity used for the given
samples
-y [ --indel-heterozygosity ] arg (=0.0001)
Germline indel heterozygosity for the
given samples
--use-uniform-genotype-priors Use a uniform prior model when
calculating genotype posteriors
--max-genotypes arg (=5000) The maximum number of genotypes to
evaluate
--max-joint-genotypes arg (=1000000) The maximum number of joint genotype
vectors to consider when computing
joint genotype posterior probabilities
--use-independent-genotype-priors Use independent genotype priors for
joint calling
--model-posterior arg Calculate model posteriors for every
call
--inactive-flank-scoring arg (=1) Disables additional calculation to
adjust alignment score when there are
inactive candidates in haplotype
flanking regions
--model-mapping-quality arg (=1) Include the read mapping quality in the
haplotype likelihood calculation
--sequence-error-model arg (=HiSeq) The sequencer error model to use (HiSeq
or xTen)
--max-vb-seeds arg (=12) Maximum number of seeds to use for
Variational Bayes algorithms
Phasing:
-l [ --phasing-level ] arg (=normal) Level of phasing - longer range phasing
can improve calling accuracy at the
cost of runtime speed. Possible values
are: minimal, conservative, moderate,
normal, aggressive
--min-phase-score arg (=10) Minimum phase score (phred scale)
required to report sites as phased
Variant filtering:
-f [ --call-filtering ] arg (=1) Enable all variant call filtering
--filter-expression arg (=QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1)
Boolean expression to use to filter
variant calls
--somatic-filter-expression arg (=QUAL < 2 | GQ < 20 | MQ < 30 | SMQ < 40 | SB > 0.9 | SD > 0.9 | BQ < 20 | DP < 3 | MF > 0.2 | NC > 1 | FRF > 0.5)
Boolean expression to use to filter
somatic variant calls
--denovo-filter-expression arg (=QUAL < 50 | PP < 40 | GQ < 20 | MQ < 30 | AF < 0.1 | SB > 0.95 | BQ < 20 | DP < 10 | DC > 1 | MF > 0.2 | FRF > 0.5 | MP < 30 | MQ0 > 2)
Boolean expression to use to filter
somatic variant calls
--refcall-filter-expression arg (=QUAL < 2 | GQ < 20 | MQ < 10 | DP < 10 | MF > 0.2)
Boolean expression to use to filter
homozygous reference calls
--use-calling-reads-for-filtering arg (=0)
Use the original reads used for variant
calling for filtering
--keep-unfiltered-calls Keep a copy of unfiltered calls
--training-annotations arg Outputs all calls as PASS and annotates
output VCF with the given measures or
measure set
--filter-vcf arg Filter the given Octopus VCF without
calling
--forest-file arg Trained Ranger random forest file
--somatic-forest-file arg Trained Ranger random forest file for
somatic variants
実行方法
リファレンスとbamを指定して実行する。
octopus --reference ref.fasta --reads input.bam --threads 4 > octopus.vcf
#dockerなら
docker run --rm -itv $PWD:/data octopus /tmp/octopus/bin/octopus \
--reference /data/ref.fasta --reads /data/input.bam --threads 4\
> /data/output.vcf
領域指定。"-T chr:start-end"で指定するか、bedファイルを読み込む。
octopus --referencehs37d5.fa --reads NA12878.bam -t regions.bed > output.vcf
スキップする領域を指定するには"-K chr:start-end"、または"-t regions.bed"。
Family trios
octopus --reference hs37d5.fa --reads NA12878.bam NA12891.bam NA12892.bam -M NA12892 -F NA12891 -o output.vcf
- -M maternal sample
- -F paternal sample
Normal tumour pair
octopus --reference hs37d5.fa --reads normal.bam tumour.bam --normal-sample NORMAL > output.vcf
複数tumourサンプルがある場合
octopus --reference hs37d5.fa --reads normal.bam tumourA.bam tumourB.bam --normal-sample NORMAL
家系情報がない同一集団のJoint variant calling (現バージョンではexperimental)(*2)
octopus --reference hs37d5.fa --reads NA12878.bam NA12891.bam NA12892.bam
他にも様々な解析例がまとめられています。Githubで確認して下さい。
https://github.com/luntergroup/octopus
引用
A unified haplotype-based method for accurate and comprehensive variant calling
Daniel P Cooke, David C Wedge, Gerton Lunter
bioRxiv preprint first posted online Oct. 29, 2018
doi: https://doi.org/10.1101/456103
2021/04/05
A unified haplotype-based method for accurate and comprehensive variant calling
Daniel P. Cooke, David C. Wedge & Gerton Lunter
Nature Biotechnology (2021), Published: 29 March 2021
*1
Dockerfile61行目の
RUN ./scripts/install.py --root --threads=2
↓
RUN ./scripts/install.py --threads=2
に変えてビルドした。
*2
Joint calling
Genotype Calling From Ngs Data - Per Sample Or Multi Sample?
Variation & Genotype Calling From Ngs Data - Per Sample Or Multi Sample?