2018-04-24

ヒトゲノムの統合バリアント検出パイプライン speedseq

2018 8/7　ホストからジョブを投げるようにコマンド修正

2018 8/8　realignコマンド修正

2020 4/15 コマンド記載ミス修正

2022/09/17 タイトル修正

　第2世代のDNA配列決定技術の技術的進歩により、全ゲノム配列決定（WGS）データを生成するために必要なコストと時間が削減され、これまでにない深さと範囲でヒトゲノムを調査することができるようになった。しかし、計算処理やバリアント解釈のボトルネックは、時間に敏感で大規模なプロジェクトでの技術の採用を妨げていた。 BWA（論文より　ref.1）、GATK（ref.2）、SAMtools（ref.3）、およびPicardに基づく標準パイプラインを使用して、50xのヒトゲノムのrawシーケンスデータを32スレッドサーバーで解析しバリアントコールを行うには、60〜70時間が必要になっている（論文　 Supplementary Note 1）。さらに、病原性と良性の突然変異を区別するには、患者一人につき、手動のキュレーションで行う必要があり、数時間 ~ 数日かかる労働集約的なプロセスになっている（ref.4）。

　SpeedSeqは、迅速な全ゲノムバリアント検出と解釈のために設計されたオープンソースソフトウェアスイートである。そのモジュラーアーキテクチャーとユニバーサルフォーマットは、さまざまな実験デザインや標準的な業界ソフトウェアとの互換性を実現している（論文　図1a）。 SAMBLASTER（ref.5）（紹介）による迅速な重複マーキング、外部アプリケーションのバランスの取れた並列化、非依存パイプラインコンポーネントの同時実行により、優れた処理効率を実現する。SpeedSeqは、128GBのRAMを搭載した単一の16コアサーバーで、rawの50 × WGSデータを13時間以内に優先順位を付けたSNVコール、small indelコール、および構造変異（SV）コールを行うことができる（現在のコスト： $ 10,000）、著者らの加速化された実装は、元のソフトウェア（論文　Supplementary Note 1）と比較してほとんどまたは全く差異を示さない。これは、一般的なコンピューティングリソースを使用して、現在のプラクティスに比べて最小で数倍の速度向上を示している）。（一部略）

　SpeedSeqはまた、急速な新生児診断などの臨床応用に不可欠な家族内での共同マルチサンプル変異コール、および新規の生殖系列突然変異コール（supplementary Note 3）もサポートしている（ref.8）。癌ゲノム解析は、研究および臨床環境における一般的なWGSアプリケーションであり、時間に敏感なコンポーネントとなり得る。Heterogeneous tumor-normal pairからのWGSデータセットをエミュレートするために、著者らはNA12877を「normal」サンプルとして定義し、等しい割合で11人の子供からrawデータをプールして単一の50×「tumor」サンプルを生成した。母親（NA12878）に存在するが父親（NA12877）には存在しない875,206のSNVは、変異対立遺伝子頻度（VAF）が0.05~0.5の体細胞突然変異として定義された（supplementary fig.2a）。この評価パラダイムを使用して、SpeedSeqのパフォーマンスを他の3つの主要な体細胞変異ツールMuTect（ref.9）、SomaticSniper（ref.10）、およびVarScan2（ref.11）と比較した。SpeedSeqはFDRが3.3％の腫瘍における体細胞変異体の96.6％を報告した。MuTect、SomaticSniperとVarScan 2（論文　図1d、supplementary図2b、c）に対する競争力のあるパフォーマンスを提供する。（一部略）

　SpeedSeqの結果は、コマンドラインやグラフィカルなブラウザインタフェースを使って効率的にフィルタリングするために、dbSNP、ENCODE、ClinVar、CADD、ESP、ExACなどの外部データベースからの情報でコールに注釈を付けるGEMINI variant interpretation framework（リンク）にシームレスに統合される。ユーザーは、癌の体細胞変異のdatabaseであるCOSMIC（リンク）と、Drug-Gene Interaction databaseのDGIdb（リンク）を通して、体細胞変異にすばやく優先順位をつけることができる。遺伝子に影響を及ぼす推定体細胞遺伝子融合だけでなく、遺伝子量を変更するか、または転写物を中断する構造変異体を同定することができる。

f:id:kazumaxneo:20180424152357j:plain

Githubより転載。複数サンプルの解析、tumorの体細胞変異検出などが簡単に実行できる。

7/29 dockerイメージからRunするよう修正

インストール

ubuntu18.04に導入してテストした。

dockerを使えばmacでも簡単に動くが、データが大きいとalnなどのコマンドでエラーがでることがある。linuxマシンで使うのが賢明です。

依存

g++ and the standard C and C++ development libraries (https://gcc.gnu.org/)
CMake (http://www.cmake.org/)
GNU awk and core utils
Python 2.7 (https://www.python.org/)
- numpy
- pysam 0.8.0+
- scipy
ROOT (https://root.cern.ch/) (required if running CNVnator)
Variant Effect Predictor (http://www.ensembl.org/info/docs/tools/vep/index.html) (required if annotating VCF files)

Core component

BWA (http://bio-bwa.sourceforge.net/)
FreeBayes (https://github.com/ekg/freebayes)
LUMPY (https://github.com/arq5x/lumpy-sv)
Sambamba (http://lomereiter.github.io/sambamba/)（紹介）
SAMBLASTER (https://github.com/GregoryFaust/samblaster)（紹介）
Vawk (https://github.com/cc2qe/vawk)
GNU Parallel (http://www.gnu.org/software/parallel/)
mbuffer (http://www.maier-komor.de/mbuffer.html)

Optional

CNVnator (http://sv.gersteinlab.org/)

Rootのインストールはこのあたりを参考にしてください。

https://openbook4.me/projects/134/sections/753

エラーが出るならおそらく足りないものがあります。

https://root-forum.cern.ch/t/installing-root-on-ubuntu-14-04/19460

複数ツールを動かすため、依存がかなりあります。dockerのコンテナがいくつか公開されているので、それを利用してください。自分はdockerを利用してosxに導入しました。

https://hub.docker.com/search/?isAutomated=0&isOfficial=0&page=1&pullCount=0&q=speedseq&starCount=0

ここではsowmiuさんのコンテナを使う。

docker pull sowmiu/speedseq

docker images #でコンテナIDがあるか確認。
#ここ（リンク）を参考に共有ディレクトリを指定

#ランする。fastqのディレクトリ/Volumes/10TB/human/をコンテナのhomeと共有
docker run --rm -itv /Volumes/10TB/human:/home sowmiu/speedseq

#下記では、コンテナ内ではなくホストからジョブを投げるように修正している。

speedseqのコンパイルだけなら、makeだけでいける。

本体　Github

git clone --recursive https://github.com/hall-lab/speedseq 
cd speedseq
make

> docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq -h

$ ./speedseq -h

Program: speedseq

Version: 0.1.2

Author: Colby Chiang (cc2qe@virginia.edu)

usage: speedseq <command> [options]

command: align align FASTQ files with BWA-MEM

var call SNV and indel variants with FreeBayes

somatic call somatic SNV and indel variants in a tumor/normal pair with FreeBayes

sv call SVs with LUMPY

realign realign from a coordinate sorted BAM file

options: -h show this message

> docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq align -h

$ docker run --rm -itv /Volumes/4TB-gene_new3/test:/home sowmiu/speedseq speedseq align -h

usage: speedseq align [options] <reference.fa> <in1.fq> [in2.fq]

positional args:

reference.fa

fasta file (indexed with bwa)

in1.fq paired-end fastq file. if -p flag is used then expected to be

an interleaved paired-end fastq file, and in2.fq may be omitted.

(can be gzipped)

in2.fq paired-end fastq file. (can be gzipped)

alignment options:

-o STR output prefix [in1.fq]

-R STR read group header line such as "@RG\tID:id\tSM:samplename\tLB:lib" (required)

-p first fastq file consists of interleaved paired-end sequences

-t INT threads [1]

-T DIR temp directory [./output_prefix.XXXXXXXXXXXX]

-I FLOAT[,FLOAT[,INT[,INT]]]

specify the mean, standard deviation (10% of the mean if absent), max

(4 sigma from the mean if absent) and min of the insert size distribution.

FR orientation only. [inferred]

samblaster options:

-i include duplicates in splitters and discordants

-c INT maximum number of split alignments for a read to be included in splitter file [2]

-m INT minimum non-overlapping base pairs between two alignments for a read to be included in splitter file [20]

sambamba options:

-M amount of memory in GB to be used for sorting [20]

global options:

-K FILE path to speedseq.config file (default: same directory as speedseq)

-v verbose

-h show this message

——

> docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq var -h

usage: speedseq var [options] <reference.fa> <input1.bam> [input2.bam [...]]

positional args:

reference.fa

genome reference fasta file

input.bam

BAM file(s) to call variants on. Must have readgroup information,

and the SM readgroup tags will be the VCF column header

options:

-o STR output prefix [input1.bam]

-w FILE BED file of windowed genomic intervals

-q FLOAT minimum variant QUAL score to output [1]

-t INT threads [1]

-T DIR temp directory [./output_prefix.XXXXXXXXXXXX]

-A annotate the vcf with VEP

-a VEP assembly to use [GRCh37]

-K FILE path to speedseq.config file (default: same directory as speedseq)

-v verbose

-k keep temporary files

-h show this message

——

> docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq somatic -h

usage: speedseq somatic [options] <reference.fa> <normal.bam> <tumor.bam>

positional args:

reference.fa

genome reference fasta file

normal.bam

germline BAM file(s) (comma separated BAMs from multiple libraries).

Must have readgroup information, and the SM readgroup tag will

be the VCF column header

tumor.bam

tumor BAM file(s) (comma separated BAMs for multiple libraries).

Must have readgroup information, and the SM readgroup tag will

be the VCF column header

options:

-o STR output prefix [tumor.bam]

-w FILE BED file of windowed genomic intervals

-t INT threads [1]

-F FLOAT Require at least this fraction of observations supporting

an alternate allele within a single individual in order

to evaluate the position [0.05]

-C INT Require at least this count of observations supporting

an alternate allele within a single individual in order

to evaluate the position [2]

-S FLOAT minimum somatic score (SSC) for PASS [18]

-q FLOAT minimum QUAL score to output non-passing somatic variants [1e-5]

-T DIR temp directory [./output_prefix.XXXXXXXXXXXX]

-A annotate the vcf with VEP

-a VEP assembly to use [GRCh37]

-K FILE path to speedseq.config file (default: same directory as speedseq)

-v verbose

-k keep tempory files

-h show this message

——

> docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq sv -h

usage: speedseq sv [options]

sv options:

-B FILE full BAM file(s) (comma separated) (required)

-S FILE split reads BAM file(s) (comma separated) (required)

-D FILE discordant reads BAM files(s) (comma separated) (required)

-R FILE indexed reference genome fasta file (required)

-o STR output prefix [fullBam.bam]

-t INT threads [1]

-x FILE BED file to exclude

-g genotype SV breakends with svtyper

-d calculate read-depth with CNVnator

-w INT CNVnator window size [100]

-A annotate the vcf with VEP

-a VEP assembly to use [GRCh37]

-P output LUMPY probability curves in VCF

-m INT minimum sample weight for a call [4]

-r FLOAT trim threshold [0]

-T DIR temp directory [./output_prefix.XXXXXXXXXXXX]

-k keep temporary files

global options:

-K FILE path to speedseq.config file (default: same directory as speedseq)

-v verbose

-h show this message

——

> docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq realign -h

usage: speedseq realign [options] <reference.fa> <in1.bam> [in2.bam [...]]

positional args:

reference.fa

fasta file (indexed with bwa)

in.bam BAM file(s) (must contain read group tags)

alignment options:

-o STR output prefix [in.realign]

-I FLOAT[,FLOAT[,INT[,INT]]]

specify the mean, standard deviation (10% of the mean if absent), max

(4 sigma from the mean if absent) and min of the insert size distribution.

FR orientation only. [inferred]

-n rename reads for smaller file size

-t INT threads [1]

-T DIR temp directory [./output_prefix.XXXXXXXXXXXX]

-R STR read group header line such as "@RG\tID:id\tSM:samplename\tLB:lib"

WARNING: By default SpeedSeq will automatically parse readgroup info

from the input BAM headers. This option will supercede that info and

should therefore only be used when input BAM(s) lack readgroups or when

custom editing of readgroups is desired.

samblaster options:

-i include duplicates in splitters and discordants

-c INT maximum number of split alignments for a read to be included in splitter file [2]

-m INT minimum non-overlapping base pairs between two alignments for a read to be included in splitter file [20]

sambamba options:

-M amount of memory in GB to be used for sorting [20]

global options:

-K FILE path to speedseq.config file (default: same directory as speedseq)

-v verbose

-h show this message

——

テストラン。

cd example 
./run_speedseq

example.bam, example.discordants.bam, example.splitters.bam, example.vcf.gz, example.sv.vcf.gzができる。

ラン

全データをdockerコンテナの/dataに保存し、/dataに書き出すというルールで進めることにする。見にくかったらすみません。

NA12878のデータを使う（リンク）。

１、align

マッピングして、coordinate sortされたbamを作成する。duplicateにタグもつけられる。入力はペアエンドのfastq。

docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq \
speedseq align -o /data/NA12878 -R \
"@RG\tID:NA12878.S1\tSM:NA12878\tLB:lib1" -t 20 -v \
/data/human_g1k_v37.fasta /data/R1.fq /data/R2.fq

-o output prefix [default: in1.fq]
-R read group header line such as "@RG\tID:id\tSM:samplename\tLB:lib" (required)
-t threads [1]
-v verbose

interleaveのfastqを指定する時は-pをつける。

/data/に３つのbamが出力される。上ではprefixをNA12878にしている。以降もこのprefixに従って書いている。

NA12878.bam: dupllicateのタグがつき、sortされたアライメントファイル。このbamがspeedseq var、speedseq somatic、speedseq svの入力となる。

NA12878.splitters.bam; bwa memでsplitされたリードのbam。

NA12878.discordants.bam; splitter.bamと同じ手順でdiscordant read pair（strand orientation, intrachromosomal distance, or interchromosomal mapping）に定義されたリードのbam。

２、var

freebayesによるバリアントコール。SNVとsmall indelがコールされる。alignで作成したbamを使用する。例えば３サンプルのbamを解析する。

docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq \
speedseq var -t 20 \
 -o /data/variant \
 -w /data/ceph18.b37.include.2014-01-15.bed \
 /data/human_g1k_v37.fasta \
 /data/sample1.bam /data/sample2.bam /data/sample3.bam

-o output prefix [default: input1.bam]
-q minimum variant QUAL score to output [1]
-t number of threads to use [default: 1]
-A annotate the vcf with VEP
-v verbose
-w BED file of windowed genomic intervals. For human genomes, we recommend using the annotations/ceph18.b37.include.2014-01-15.bed (see Annotations)

/data/にvariant.vcf.gz が出力される。 VCFフォーマットに従い、複数サンプルの変異が1ファイルに全てまとめられる。

３、somatic（somatic/tumor calling）

freebayesによるsomaticとtumorの２サンプルを比較によるtumorバリアントのコール。5%を下限の検出閾値とする。

docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq \
speedseq somatic -o /data/somatic -F 0.05 -q 1 /data/human_g1k_v37.fasta \
/data/normal.bam tumor.bam

-t number of threads to use [default: 1]
-F require at least this fraction of observations supporting an alternate allele within a single individual in order to evaluate the position [0.05]
-q minimum QUAL score to output non-passing somatic variants [1e-5]

/data/にsomatic.vcf.gz が出力される。

４、sv

LUMPYを使ってSVを検出する。alignコマンドで作った３つのbamを指定する。

docker run --rm -itv /Volumes/10TB/human:/data sowmiu/speedseq speedseq \
speedseq sv -o /data/NA12878 -x /data/ceph18.b37.lumpy.exclude.2014-01-15.bed -R /data/human_g1k_v37.fasta \
-B /data/NA12878.bam -D /data/NA12878.discordants.bam -S /data/NA12878.splitters.bam

-t INT threads [1]
-x BED file to exclude
-B full BAM file(s) (comma separated) (required) example: -B in1.bam,in2.bam,in3.bam
-D discordant reads BAM file(s) (comma separated, order same as in -B)-D FILE discordant reads BAM file(s) (comma separated, order same as in -B) (auto-generated if absent) example: -D in1.discordants.bam,in2.discordants.bam,in3.discordants.bam
-S split reads BAM file(s) (comma separated, order same as in -B) (auto-generated if absent) example: -S in1.splitters.bam,in2.splitters.bam,in3.splitters.bam
-g genotype SV breakends with svtyper
-d calculate read-depth with CNVnator

-gと-dをつけると、svtyperとCNVnatorによるsv情報も統合される。

/data/にNA12878.sv.vcf.gz が出力される。

この他、speedseq以外のツールで作ったbamなどを入力とし、duplicates markを付けspeedseqで使えるように修正する speedseq realignコマンドがある。

speedseq realign  -o /data/NA12878_realign.bam \
/data/human_g1k_v37.fasta /data/NA12878.bam

Githubでは、複数検体のfastqの解析例、tumorのバリアント検出例、Triosの解析例、複数のライブラリで読んだfastqのマージ例なども記載されています。

https://github.com/hall-lab/speedseq#installation

追記

amazon EC2にも準備されており、このモジュールを呼び出し、自分の契約した条件ですぐに解析することができるようになっています。詳細はGithubにまとめられていますが、この方法が一番簡単だと思います。現実的なコンピュータ時間でヒトゲノムのバリアント解析を行う場合、EC2だと/humanでどれくらいの費用が発生するのか気になるところです。

引用

SpeedSeq: ultra-fast personal genome analysis and interpretation.

Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM

Nat Methods. 2015 Oct;12(10):966-8.