macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

germlineとsomaticのSNVとsmall indelを検出する Strelka2

 

Strelkaは、マッピングされたbamから生殖細胞系列および体細胞系列の変異を検出する。体細胞突然変異の検出では、約5-10%の腫瘍の純度まで良好な結果を出せるとされる。 デフォルトでは49以下のサイズのindelも検出する。入力サンプルのシーケンシングデータから推定されるindelエラー率等を考慮し、haplotypeのモデルを組み立てる(wiki)。

 

マニュアル

https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md

 

インストール

Github

https://github.com/Illumina/strelka/blob/master/docs/userGuide/installation.md

リリースからビルド済みstrelka-2.9.0.centos6_x86_64.tar.bz2をダウンロードできる。

 

cd strelka-2.9.0.centos6_x86_64/libexec/
cd ../bin/
 #動作確認

 

python configureStrelkaSomaticWorkflow.py

$ python configureStrelkaSomaticWorkflow.py

Usage: configureStrelkaSomaticWorkflow.py [options]

 

Version: 2.9.0

 

This script configures Strelka somatic small variant calling.

You must specify an alignment file (BAM or CRAM) for each sample of a matched tumor-normal pair.

 

Configuration will produce a workflow run script which

can execute the workflow on a single node or through

sge and resume any interrupted execution.

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  --config=FILE         provide a configuration file to override defaults in

                        global config file (/home/uesaka/strelka-2.9.0.centos6

                        _x86_64/bin/configureStrelkaSomaticWorkflow.py.ini)

  --allHelp             show all extended/hidden options

 

  Workflow options:

    --normalBam=FILE    Normal sample BAM or CRAM file. (no default)

    --tumorBam=FILE, --tumourBam=FILE

                        Tumor sample BAM or CRAM file. [required] (no default)

    --outputCallableRegions

                        Output a bed file describing somatic callable regions

                        of the genome

    --referenceFasta=FILE

                        samtools-indexed reference fasta file [required]

    --indelCandidates=FILE

                        Specify a VCF of candidate indel alleles. These

                        alleles are always evaluated but only reported in the

                        output when they are inferred to exist in the sample.

                        The VCF must be tabix indexed. All indel alleles must

                        be left-shifted/normalized, any unnormalized alleles

                        will be ignored. This option may be specified more

                        than once, multiple input VCFs will be merged.

                        (default: None)

    --forcedGT=FILE     Specify a VCF of candidate alleles. These alleles are

                        always evaluated and reported even if they are

                        unlikely to exist in the sample. The VCF must be tabix

                        indexed. All indel alleles must be left-

                        shifted/normalized, any unnormalized allele will

                        trigger a runtime error. This option may be specified

                        more than once, multiple input VCFs will be merged.

                        Note that for any SNVs provided in the VCF, the SNV

                        site will be reported (and for gVCF, excluded from

                        block compression), but the specific SNV alleles are

                        ignored. (default: None)

    --exome, --targeted

                        Set options for exome or other targeted input: note in

                        particular that this flag turns off high-depth filters

    --callRegions=FILE  Optionally provide a bgzip-compressed/tabix-indexed

                        BED file containing the set of regions to call. No VCF

                        output will be provided outside of these regions. The

                        full genome will still be used to estimate statistics

                        from the input (such as expected depth per

                        chromosome). Only one BED file may be specified.

                        (default: call the entire genome)

    --runDir=DIR        Name of directory to be created where all workflow

                        scripts and output will be written. Each analysis

                        requires a separate directory. (default:

                        StrelkaSomaticWorkflow)

 

 

> python configureStrelkaGermlineWorkflow.py

$ python configureStrelkaGermlineWorkflow.py

Usage: configureStrelkaGermlineWorkflow.py [options]

 

Version: 2.9.0

 

This script configures Strelka germline small variant calling.

You must specify an alignment file (BAM or CRAM) for at least one sample.

 

Configuration will produce a workflow run script which

can execute the workflow on a single node or through

sge and resume any interrupted execution.

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  --config=FILE         provide a configuration file to override defaults in

                        global config file (/home/uesaka/strelka-2.9.0.centos6

                        _x86_64/bin/configureStrelkaGermlineWorkflow.py.ini)

  --allHelp             show all extended/hidden options

 

  Workflow options:

    --bam=FILE          Sample BAM or CRAM file. May be specified more than

                        once, multiple inputs will be treated as each BAM file

                        representing a different sample. [required] (no

                        default)

    --ploidy=FILE       Provide ploidy file in VCF. The VCF should include one

                        sample column per input sample labeled with the same

                        sample names found in the input BAM/CRAM RG header

                        sections. Ploidy should be provided in records using

                        the FORMAT/CN field, which are interpreted to span the

                        range [POS+1, INFO/END]. Any CN value besides 1 or 0

                        will be treated as 2. File must be tabix indexed. (no

                        default)

    --noCompress=FILE   Provide BED file of regions where gVCF block

                        compression is not allowed. File must be bgzip-

                        compressed/tabix-indexed. (no default)

    --callContinuousVf=CHROM

                        Call variants on CHROM without a ploidy prior

                        assumption, issuing calls with continuous variant

                        frequencies (no default)

    --rna               Set options for RNA-Seq input.

    --referenceFasta=FILE

                        samtools-indexed reference fasta file [required]

    --indelCandidates=FILE

                        Specify a VCF of candidate indel alleles. These

                        alleles are always evaluated but only reported in the

                        output when they are inferred to exist in the sample.

                        The VCF must be tabix indexed. All indel alleles must

                        be left-shifted/normalized, any unnormalized alleles

                        will be ignored. This option may be specified more

                        than once, multiple input VCFs will be merged.

                        (default: None)

    --forcedGT=FILE     Specify a VCF of candidate alleles. These alleles are

                        always evaluated and reported even if they are

                        unlikely to exist in the sample. The VCF must be tabix

                        indexed. All indel alleles must be left-

                        shifted/normalized, any unnormalized allele will

                        trigger a runtime error. This option may be specified

                        more than once, multiple input VCFs will be merged.

                        Note that for any SNVs provided in the VCF, the SNV

                        site will be reported (and for gVCF, excluded from

                        block compression), but the specific SNV alleles are

                        ignored. (default: None)

    --exome, --targeted

                        Set options for exome or other targeted input: note in

                        particular that this flag turns off high-depth filters

    --callRegions=FILE  Optionally provide a bgzip-compressed/tabix-indexed

                        BED file containing the set of regions to call. No VCF

                        output will be provided outside of these regions. The

                        full genome will still be used to estimate statistics

                        from the input (such as expected depth per

                        chromosome). Only one BED file may be specified.

                        (default: call the entire genome)

    --runDir=DIR        Name of directory to be created where all workflow

                        scripts and output will be written. Each analysis

                        requires a separate directory. (default:

                        StrelkaGermlineWorkflow)

 

  Extended options (hidden):

 

 

Somaticのテストランを行う。

cd strelka-2.9.0.centos6_x86_64/bin/
bash runStrelkaSomaticWorkflowDemo.bash

正常に終われば、カレントにstrelkaSomaticDemoAnalysis/results/variants/ができ、その中にSNVsとIndelsのvcf.gzができる。

> cat somatic.snvs.vcf

f:id:kazumaxneo:20180218124750j:plain

bash runStrelkaGermlineWorkflowDemo.bash を打てばgermlineのテストランも実行できる。

 

ラン

体細胞変異の検出。SVを検出するmantaの出力ディレクトリも指定すれば、マージしたvcfを出力できる。

./configureStrelkaSomaticWorkflow.py --normalBam HCC1187BL.bam --tumorBam HCC1187C.bam --referenceFasta hg19.fa --indelCandidates ${MANTA_ANALYSIS_PATH}/results/variants/candidateSmallIndels.vcf.gz --runDir outout

 

生殖細胞のランも同様のようです。出力のgVCFの詳細はgithubのマニュアルを確認してください。 

 

 

引用

Strelka2: Fast and accurate variant calling for clinical sequencing applications

Sangtae Kim, Konrad Scheffler, Aaron L Halpern, Mitchell A Bekritsky, Eunho Noh, Morten Källberg, Xiaoyu Chen, Doruk Beyter, Peter Krusche, Christopher T Saunders

doi: https://doi.org/10.1101/192872