2019 5/30 インストール追記
Strelkaは、マッピングされたbamから生殖細胞系列および体細胞系列の変異を検出する。体細胞突然変異の検出では、約5-10%の腫瘍の純度まで良好な結果を出せるとされる。 デフォルトでは49以下のサイズのindelも検出する。入力サンプルのシーケンシングデータから推定されるindelエラー率等を考慮し、haplotypeのモデルを組み立てる(wiki)。
マニュアル
https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md
インストール
https://github.com/Illumina/strelka/blob/master/docs/userGuide/installation.md
リリースからビルド済みstrelka-2.9.0.centos6_x86_64.tar.bz2をダウンロードできる。
cd strelka-2.9.0.centos6_x86_64/libexec/
cd ../bin/
#動作確認
#bioconda(link)
conda install -c bioconda -y strelka
> python configureStrelkaSomaticWorkflow.py
$ python configureStrelkaSomaticWorkflow.py
Usage: configureStrelkaSomaticWorkflow.py [options]
Version: 2.9.0
This script configures Strelka somatic small variant calling.
You must specify an alignment file (BAM or CRAM) for each sample of a matched tumor-normal pair.
Configuration will produce a workflow run script which
can execute the workflow on a single node or through
sge and resume any interrupted execution.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--config=FILE provide a configuration file to override defaults in
global config file (/home/uesaka/strelka-2.9.0.centos6
_x86_64/bin/configureStrelkaSomaticWorkflow.py.ini)
--allHelp show all extended/hidden options
Workflow options:
--normalBam=FILE Normal sample BAM or CRAM file. (no default)
--tumorBam=FILE, --tumourBam=FILE
Tumor sample BAM or CRAM file. [required] (no default)
--outputCallableRegions
Output a bed file describing somatic callable regions
of the genome
--referenceFasta=FILE
samtools-indexed reference fasta file [required]
--indelCandidates=FILE
Specify a VCF of candidate indel alleles. These
alleles are always evaluated but only reported in the
output when they are inferred to exist in the sample.
The VCF must be tabix indexed. All indel alleles must
be left-shifted/normalized, any unnormalized alleles
will be ignored. This option may be specified more
than once, multiple input VCFs will be merged.
(default: None)
--forcedGT=FILE Specify a VCF of candidate alleles. These alleles are
always evaluated and reported even if they are
unlikely to exist in the sample. The VCF must be tabix
indexed. All indel alleles must be left-
shifted/normalized, any unnormalized allele will
trigger a runtime error. This option may be specified
more than once, multiple input VCFs will be merged.
Note that for any SNVs provided in the VCF, the SNV
site will be reported (and for gVCF, excluded from
block compression), but the specific SNV alleles are
ignored. (default: None)
--exome, --targeted
Set options for exome or other targeted input: note in
particular that this flag turns off high-depth filters
--callRegions=FILE Optionally provide a bgzip-compressed/tabix-indexed
BED file containing the set of regions to call. No VCF
output will be provided outside of these regions. The
full genome will still be used to estimate statistics
from the input (such as expected depth per
chromosome). Only one BED file may be specified.
(default: call the entire genome)
--runDir=DIR Name of directory to be created where all workflow
scripts and output will be written. Each analysis
requires a separate directory. (default:
StrelkaSomaticWorkflow)
> python configureStrelkaGermlineWorkflow.py
$ python configureStrelkaGermlineWorkflow.py
Usage: configureStrelkaGermlineWorkflow.py [options]
Version: 2.9.0
This script configures Strelka germline small variant calling.
You must specify an alignment file (BAM or CRAM) for at least one sample.
Configuration will produce a workflow run script which
can execute the workflow on a single node or through
sge and resume any interrupted execution.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--config=FILE provide a configuration file to override defaults in
global config file (/home/uesaka/strelka-2.9.0.centos6
_x86_64/bin/configureStrelkaGermlineWorkflow.py.ini)
--allHelp show all extended/hidden options
Workflow options:
--bam=FILE Sample BAM or CRAM file. May be specified more than
once, multiple inputs will be treated as each BAM file
representing a different sample. [required] (no
default)
--ploidy=FILE Provide ploidy file in VCF. The VCF should include one
sample column per input sample labeled with the same
sample names found in the input BAM/CRAM RG header
sections. Ploidy should be provided in records using
the FORMAT/CN field, which are interpreted to span the
range [POS+1, INFO/END]. Any CN value besides 1 or 0
will be treated as 2. File must be tabix indexed. (no
default)
--noCompress=FILE Provide BED file of regions where gVCF block
compression is not allowed. File must be bgzip-
compressed/tabix-indexed. (no default)
--callContinuousVf=CHROM
Call variants on CHROM without a ploidy prior
assumption, issuing calls with continuous variant
frequencies (no default)
--rna Set options for RNA-Seq input.
--referenceFasta=FILE
samtools-indexed reference fasta file [required]
--indelCandidates=FILE
Specify a VCF of candidate indel alleles. These
alleles are always evaluated but only reported in the
output when they are inferred to exist in the sample.
The VCF must be tabix indexed. All indel alleles must
be left-shifted/normalized, any unnormalized alleles
will be ignored. This option may be specified more
than once, multiple input VCFs will be merged.
(default: None)
--forcedGT=FILE Specify a VCF of candidate alleles. These alleles are
always evaluated and reported even if they are
unlikely to exist in the sample. The VCF must be tabix
indexed. All indel alleles must be left-
shifted/normalized, any unnormalized allele will
trigger a runtime error. This option may be specified
more than once, multiple input VCFs will be merged.
Note that for any SNVs provided in the VCF, the SNV
site will be reported (and for gVCF, excluded from
block compression), but the specific SNV alleles are
ignored. (default: None)
--exome, --targeted
Set options for exome or other targeted input: note in
particular that this flag turns off high-depth filters
--callRegions=FILE Optionally provide a bgzip-compressed/tabix-indexed
BED file containing the set of regions to call. No VCF
output will be provided outside of these regions. The
full genome will still be used to estimate statistics
from the input (such as expected depth per
chromosome). Only one BED file may be specified.
(default: call the entire genome)
--runDir=DIR Name of directory to be created where all workflow
scripts and output will be written. Each analysis
requires a separate directory. (default:
StrelkaGermlineWorkflow)
Extended options (hidden):
Somaticのテストランを行う。
cd strelka-2.9.0.centos6_x86_64/bin/
bash runStrelkaSomaticWorkflowDemo.bash
正常に終われば、カレントにstrelkaSomaticDemoAnalysis/results/variants/ができ、その中にSNVsとIndelsのvcf.gzができる。
> cat somatic.snvs.vcf
bash runStrelkaGermlineWorkflowDemo.bash を打てばgermlineのテストランも実行できる。
ラン
体細胞変異の検出。SVを検出するmantaの出力ディレクトリも指定すれば、マージしたvcfを出力できる。
./configureStrelkaSomaticWorkflow.py --normalBam HCC1187BL.bam --tumorBam HCC1187C.bam --referenceFasta hg19.fa --indelCandidates ${MANTA_ANALYSIS_PATH}/results/variants/candidateSmallIndels.vcf.gz --runDir outout
生殖細胞のランも同様のようです。出力のgVCFの詳細はgithubのマニュアルを確認してください。
引用
Strelka2: Fast and accurate variant calling for clinical sequencing applications
Sangtae Kim, Konrad Scheffler, Aaron L Halpern, Mitchell A Bekritsky, Eunho Noh, Morten Källberg, Xiaoyu Chen, Doruk Beyter, Peter Krusche, Christopher T Saunders
doi: https://doi.org/10.1101/192872