germlineとsomaticのSNVとsmall indelを検出する Strelka2

2019 5/30 インストール追記

Strelkaは、マッピングされたbamから生殖細胞系列および体細胞系列の変異を検出する。体細胞突然変異の検出では、約5-10％の腫瘍の純度まで良好な結果を出せるとされる。デフォルトでは49以下のサイズのindelも検出する。入力サンプルのシーケンシングデータから推定されるindelエラー率等を考慮し、haplotypeのモデルを組み立てる（wiki）。

マニュアル

https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md

インストール

Github

https://github.com/Illumina/strelka/blob/master/docs/userGuide/installation.md

リリースからビルド済みstrelka-2.9.0.centos6_x86_64.tar.bz2をダウンロードできる。

cd strelka-2.9.0.centos6_x86_64/libexec/
cd ../bin/
 #動作確認

#bioconda（link） 
conda install -c bioconda -y strelka

> python configureStrelkaSomaticWorkflow.py

$ python configureStrelkaSomaticWorkflow.py

Usage: configureStrelkaSomaticWorkflow.py [options]

Version: 2.9.0

This script configures Strelka somatic small variant calling.

You must specify an alignment file (BAM or CRAM) for each sample of a matched tumor-normal pair.

Configuration will produce a workflow run script which

can execute the workflow on a single node or through

sge and resume any interrupted execution.

Options:

--version show program's version number and exit

-h, --help show this help message and exit

--config=FILE provide a configuration file to override defaults in

global config file (/home/uesaka/strelka-2.9.0.centos6

_x86_64/bin/configureStrelkaSomaticWorkflow.py.ini)

--allHelp show all extended/hidden options

Workflow options:

--normalBam=FILE Normal sample BAM or CRAM file. (no default)

--tumorBam=FILE, --tumourBam=FILE

Tumor sample BAM or CRAM file. [required] (no default)

--outputCallableRegions

Output a bed file describing somatic callable regions

of the genome

--referenceFasta=FILE

samtools-indexed reference fasta file [required]

--indelCandidates=FILE

Specify a VCF of candidate indel alleles. These

alleles are always evaluated but only reported in the

output when they are inferred to exist in the sample.

The VCF must be tabix indexed. All indel alleles must

be left-shifted/normalized, any unnormalized alleles

will be ignored. This option may be specified more

than once, multiple input VCFs will be merged.

(default: None)

--forcedGT=FILE Specify a VCF of candidate alleles. These alleles are

always evaluated and reported even if they are

unlikely to exist in the sample. The VCF must be tabix

indexed. All indel alleles must be left-

shifted/normalized, any unnormalized allele will

trigger a runtime error. This option may be specified

more than once, multiple input VCFs will be merged.

Note that for any SNVs provided in the VCF, the SNV

site will be reported (and for gVCF, excluded from

block compression), but the specific SNV alleles are

ignored. (default: None)

--exome, --targeted

Set options for exome or other targeted input: note in

particular that this flag turns off high-depth filters

--callRegions=FILE Optionally provide a bgzip-compressed/tabix-indexed

BED file containing the set of regions to call. No VCF

output will be provided outside of these regions. The

full genome will still be used to estimate statistics

from the input (such as expected depth per

chromosome). Only one BED file may be specified.

(default: call the entire genome)

--runDir=DIR Name of directory to be created where all workflow

scripts and output will be written. Each analysis

requires a separate directory. (default:

StrelkaSomaticWorkflow)

> python configureStrelkaGermlineWorkflow.py

$ python configureStrelkaGermlineWorkflow.py

Usage: configureStrelkaGermlineWorkflow.py [options]

Version: 2.9.0

This script configures Strelka germline small variant calling.

You must specify an alignment file (BAM or CRAM) for at least one sample.

Configuration will produce a workflow run script which

can execute the workflow on a single node or through

sge and resume any interrupted execution.

Options:

--version show program's version number and exit

-h, --help show this help message and exit

--config=FILE provide a configuration file to override defaults in

global config file (/home/uesaka/strelka-2.9.0.centos6

_x86_64/bin/configureStrelkaGermlineWorkflow.py.ini)

--allHelp show all extended/hidden options

Workflow options:

--bam=FILE Sample BAM or CRAM file. May be specified more than

once, multiple inputs will be treated as each BAM file

representing a different sample. [required] (no

default)

--ploidy=FILE Provide ploidy file in VCF. The VCF should include one

sample column per input sample labeled with the same

sample names found in the input BAM/CRAM RG header

sections. Ploidy should be provided in records using

the FORMAT/CN field, which are interpreted to span the

range [POS+1, INFO/END]. Any CN value besides 1 or 0

will be treated as 2. File must be tabix indexed. (no

default)

--noCompress=FILE Provide BED file of regions where gVCF block

compression is not allowed. File must be bgzip-

compressed/tabix-indexed. (no default)

--callContinuousVf=CHROM

Call variants on CHROM without a ploidy prior

assumption, issuing calls with continuous variant

frequencies (no default)

--rna Set options for RNA-Seq input.

--referenceFasta=FILE

samtools-indexed reference fasta file [required]

--indelCandidates=FILE

Specify a VCF of candidate indel alleles. These

alleles are always evaluated but only reported in the

output when they are inferred to exist in the sample.

The VCF must be tabix indexed. All indel alleles must

be left-shifted/normalized, any unnormalized alleles

will be ignored. This option may be specified more

than once, multiple input VCFs will be merged.

(default: None)

--forcedGT=FILE Specify a VCF of candidate alleles. These alleles are

always evaluated and reported even if they are

unlikely to exist in the sample. The VCF must be tabix

indexed. All indel alleles must be left-

shifted/normalized, any unnormalized allele will

trigger a runtime error. This option may be specified

more than once, multiple input VCFs will be merged.

Note that for any SNVs provided in the VCF, the SNV

site will be reported (and for gVCF, excluded from

block compression), but the specific SNV alleles are

ignored. (default: None)

--exome, --targeted

Set options for exome or other targeted input: note in

particular that this flag turns off high-depth filters

--callRegions=FILE Optionally provide a bgzip-compressed/tabix-indexed

BED file containing the set of regions to call. No VCF

output will be provided outside of these regions. The

full genome will still be used to estimate statistics

from the input (such as expected depth per

chromosome). Only one BED file may be specified.

(default: call the entire genome)

--runDir=DIR Name of directory to be created where all workflow

scripts and output will be written. Each analysis

requires a separate directory. (default:

StrelkaGermlineWorkflow)

Extended options (hidden):

Somaticのテストランを行う。

cd strelka-2.9.0.centos6_x86_64/bin/
bash runStrelkaSomaticWorkflowDemo.bash

正常に終われば、カレントにstrelkaSomaticDemoAnalysis/results/variants/ができ、その中にSNVsとIndelsのvcf.gzができる。

> cat somatic.snvs.vcf

f:id:kazumaxneo:20180218124750j:plain

bash runStrelkaGermlineWorkflowDemo.bash を打てばgermlineのテストランも実行できる。

ラン

体細胞変異の検出。SVを検出するmantaの出力ディレクトリも指定すれば、マージしたvcfを出力できる。

./configureStrelkaSomaticWorkflow.py --normalBam HCC1187BL.bam --tumorBam HCC1187C.bam --referenceFasta hg19.fa --indelCandidates ${MANTA_ANALYSIS_PATH}/results/variants/candidateSmallIndels.vcf.gz --runDir outout

生殖細胞のランも同様のようです。出力のgVCFの詳細はgithubのマニュアルを確認してください。

引用

Strelka2: Fast and accurate variant calling for clinical sequencing applications

Sangtae Kim, Konrad Scheffler, Aaron L Halpern, Mitchell A Bekritsky, Eunho Noh, Morten Källberg, Xiaoyu Chen, Doruk Beyter, Peter Krusche, Christopher T Saunders

doi: https://doi.org/10.1101/192872