2018-07-07

somaticとgermlineのバリアント検出ツール Scalpel

注: docker イメージのリンクも紹介してますが、テストするとエラーを吐きました。condaを使いlinuxマシンでに導入するのが無難なようです。

　SNVsの分析はヒト遺伝学を研究するための標準的な技術となっているが[論文より　ref.1]。、DNA配列（indels）の挿入と欠失は確実に検出することはできない[ref.2,3]。 Indelsはヒトゲノムで最も2番目に一般的な変異であり、構造変異中では最も多い[ref.4]。マイクロサテライト（単純配列反復、SSR、1〜6bpモチーフ）内で、indelsはリピートモチーフの長さを変え、40以上の神経学的疾患に関連している[ref.5]。 Indelsもまた、自閉症において重要な遺伝的要素を担っている。コードされたタンパク質を破壊する可能性のあるde novo indelsは、影響を受けていない兄弟よりも2倍近くも豊富である[ref.6]。

　indel検出は、いくつかの理由から困難である。（1）indel配列とオーバーラップするリードはアライメントが難しく、gapではなく複数のミスマッチとして扱われることがある。（2）エキソームシーケンシングのキャプチャ効率のばらつきおよび不均一なリード分布は、偽陽性の数を増加させる。（3）エラー率増加は、マイクロサテライト内での検出を非常に困難にする。この研究で示されているように、（4）局在化、ほぼ同一の反復配列は、高い陽性率をもたらす可能性がある。これらの理由から、利用可能なソフトウェアツールで検出可能なindelサイズは比較的小さく、数十塩基を超えるものは少ない[ref.8]。

　現在、indels検出には2つの主要なパラダイムが使用されている。最も一般的なアプローチは、リードマッパー（BWA、Bowtie、Novoalignなど）を使用してすべてのリードをリファレンスゲノムにマッピングすることだが、利用可能なアルゴリズムは数塩基以上のindel間のマッピングには有効ではない。先進的なアプローチではより長い変異を検出するためにペアエンド情報を使い local realignments を行うが（例えば、GATK UnifiedGenotyper[ref.1]およびDindel[ref.9]）、実際には、より長い変異（≧20bp）ではその感度が大幅に低下する。 Split-read methods（例えば、Pindel[ref.10]およびSplitread[ref.11]）は、理論的には任意のサイズの欠失を検出できるが、現在のシーケンス技術ではリード長が短いために（論文執筆時点）挿入を検出する能力は限られている。第2のパラダイムは、デノボ全ゲノムアセンブリを行い、組み立てられたコンティグとリファレンスゲノムとの間の変異を検出することからなる[ref.12,13]。より大きな突然変異を検出する可能性を有する一方で、実際には、このパラダイムは、ホモ接合型およびヘテロ接合型突然変異を正確に報告するために、細かくかつ局在化した分析が必要である。最近では、de novo aasemblyを使ったGATK HaplotypeCaller、SOAPindel[ref.14]、およびCortex[ref.15]の3種類のアプローチが開発されている。他の最近のアプローチであるTIGRA[ref.16]も、ローカルアセンブリを使用するが、ブレークポイントのみ検出するよう調整されており、indelsの配列は報告しない。

　著者らは、exome-captureデータ内のindelsを検出するマイクロアセンブリパイプラインScalpelを提示する（論文より　図1）。マッピングとアセンブリの力を組み合わせることにより、Scalpelはde Bruijn graphを慎重に検索し、各エキソンにまたがるシーケンスパス（コンティグ）を探す。このアルゴリズムには、各エキソンのオンザフライリピート組成分析と、セルフチューニングのk-mer戦略が含まれる。

公式HP

http://scalpel.sourceforge.net/manual.html

マニュアル１

http://scalpel.sourceforge.net/manual.html

マニュアル２

https://sourceforge.net/p/scalpel/wiki/Manual/

Scalpelに関するツイート。

インストール

ubuntu18.04のAnaconda2.4.2でテストした。

Github

#Anaconda環境ならcondaを使う(linux　only)
conda install -c bioconda scalpel

#dockerイメージも提供されている。
docker pull hanfang/scalpel:0.5.3

docker imagesでIDを調べてから

> scalpel-discovery -h

$ scalpel-discovery -h

Local date and time: Sat Jul 7 10:13:11 2018

Program: scalpel-discovery (micro-assembly variant detection)

Version: 0.5.3 (beta), January 25 2016

Contact: Giuseppe Narzisi <gnarzisi@nygenome.org>

usage: scalpel-discovery <COMMAND> [OPTIONS]

COMMAND:

--help : this (help) message

--verbose : verbose mode

--single : single exome study

--denovo : family study (mom,dad,affected,sibling)

--somatic : normal/tumor study

> scalpel-discovery --single

$ scalpel-discovery --single

Local date and time: Sat Jul 7 10:14:20 2018

Program: scalpel-discovery (micro-assembly variant detection)

Version: 0.5.3 (beta), January 25 2016

Contact: Giuseppe Narzisi <gnarzisi@nygenome.org>

usage: scalpel-discovery --single --bam <BAM file> --bed <BED file> --ref <FASTA file> [OPTIONS]

Detect indels in one single dataset (e.g., one individual).

OPTIONS:

--help : this (help) message

--verbose : verbose mode

Required:

--bam <BAM file> : BAM file with the reference-aligned reads

--bed <BED file> : file with list of regions (BED format) in sorted order or single region in format chr:start-end (example: 1:31656613-31656883)

--ref <FASTA file> : reference genome in FASTA format (same one that was used to create the BAM file)

Optional:

--kmer <int> : k-mer size [default 25]

--covthr <int> : threshold used to select source and sink [default 5]

--lowcov <int> : threshold used to remove low-coverage nodes [default 2]

--covratio <float> : minimum coverage ratio for sequencing errors (default: 0.01)

--radius <int> : left and right extension (in base-pairs) [default 100]

--window <int> : window-size of the region to assemble (in base-pairs) [default 400]

--maxregcov <int> : maximum average coverage allowed per region [default 10000]

--step <int> : delta shift for the sliding window (in base-pairs) [default 100]

--mapscore <int> : minimum mapping quality for selecting reads to assemble [default 1]

--pathlimit <int> : limit number of sequence paths to [default 1000000]

--mismatches <int> : max number of mismatches in near-perfect repeat detection [default 3]

--dir <directory> : output directory [default ./outdir]

--numprocs <int> : number of parallel jobs (1 for no parallelization) [default 1]

--sample <string> : only process reads/fragments in sample [default ALL]

--coords <file> : file with list of selected locations to examine [default null]

Output:

--format : export mutations in selected format (annovar | vcf) [default vcf]

--intarget : export mutations only inside the target regions from the BED file

--logs : keep log files

Note 1: the list of detected indels is saved in file: OUTDIR/variants.indel.*

where OUTDIR is the output directory selected with option "--dir" [default ./outdir]

Note 2: use the export tool (scalpel-export) to export mutations using different filtering criteria

> scalpel-discovery --somatic

$ Local date and time: Sat Jul 7 10:14:53 2018

Program: scalpel-discovery (micro-assembly variant detection)

Version: 0.5.3 (beta), January 25 2016

Contact: Giuseppe Narzisi <gnarzisi@nygenome.org>

usage: scalpel-discovery --somatic --normal <BAM file> --tumor <BAM file> --bed <BED file> --ref <FASTA file> [OPTIONS]

Detect somatic indels in a tumor/normal pair

OPTIONS:

--help : this (help) message

--verbose : verbose mode

Required:

--normal <BAM file> : normal BAM file

--tumor <BAM file> : tumor BAM file