４つの信号を使ってSVを検出する vaquita - macでインフォマティクス

　次世代シークエンシング（NGS）は、がん[論文より　ref.13]や希な遺伝病[ref.2]などの疾患に直接関連する遺伝子変異を発見する目覚しい機会を提供する。従って、そのような変異を同定することにおいてますます注目が集まっている。変異の大きさは、１塩基対からメガベースに及ぶ[ref.15]。その中でも、構造変異（SV）、すなわち通常は50ヌクレオチドより大きい変異が多くの表現型の相違において主要な役割を果たす。一塩基多型（SNPs）や小さなindelsとは対照的に、SVsははるかに種類とサイズが多様であり、しばしば自信を持って見つけることは難しい[ref.1]。したがって、バリアントコーラーの間にはわずかな合致しか存在しないことは驚くべきことではない[ref.1]。 SVの現在の理解は実質的に限定されており、大規模な研究は複数バリアントコーラーを使いSVの最も包括的なリストを得るようにしている。しかし、複数の出力の統合は、異なるアルゴリズムおよびそれらのパラメータに関する必要な事前知識のためにしばしば煩雑である。計算資源が限られており、したがって、より正確かつ効率的にSVを検出できるより良い方法が緊急に必要とされている。
　SV同定のためのアルゴリズムは、4つのタイプref.[1]に分類することができる。まず、split-read evidenceを使用することができる。ブレークポイントにまたがるリードは、複数のlociをマッピングできるように分割する。例えば、Pindel [ref.24]は、不一致のリードをsplitし、異なる位置にそれらをマッピングすることによってブレークポイントを見つけることを試みる。しかし、ゲノムのリピートリッチ領域では、読み込まれた情報だけを使ってSVを見つけることは困難である。
　さらに、read-pair情報を使用してSVを識別することができる。不適切な向きおよび/またはインサートサイズを有するリードペアを同定し、SVを検出するために使用することができる。しかし、リードペア情報だけでは、ベースペアの分解精度は得られない。したがって、Delly [ref.19]のようなバリアントコーラーは、read-pair情報とsplit-read 情報を考慮する。
　最近開発された多くのショートリードマッパー[ref.12,10]がローカルアライメントを提供する。これらのマッパーはソフトクリッピング行う。すなわちシーケンスの一部のみが参照ゲノムにマッピング可能である。マップされていないシーケンスは比較的短く、誤っているため、独自の位置にマッピングすることが困難である。この問題を解決するために、CREST [ref.23]は潜在的なブレークポイント周辺のコンティグを組み立て、Blat [9]を使用してリファレンスゲノムにマッピングする。
　最後に、read-depth情報は、コピー数バリエーションを見つける際にも有用である。しかし、シーケンシングデータのカバレッジデプスは通常不均一である[ref.14]。したがって、実際の信号とバックグラウンドノイズを区別するために、event-wise testing[ref.25]などsignificance testingが必要である。さらにread-depth情報だけでは、塩基対分解能の精度が得られない。
　しばしば、複数のアプローチ結果を統合することで、精度に関するより良いパフォーマンスが得られる。この側面では、LUMPY [ref.11]は確率論的フレームワークを使用してデフォルトでスプリットリードとリードペアの情報を結合し、MetaSV [ref.16]は複数の外部ツールの接続に焦点を当てている。
著者らの方法であるVaquitaは、split-read、read-pair、soft-clip、およびread-depth情報を1つのプログラムに統合し、速度を維持しながら最高の精度を実現する。 Vaquitaは外部ツールからの寄与なしに4種類の情報すべてを利用する。 Vaquitaの全体的なワークフローは図1に示されている。

http://drops.dagstuhl.de/opus/volltexte/2017/7635/pdf/LIPIcs-WABI-2017-13.pdfより転載。

インストール

mac os10.12でテストした。

本体 Github

https://github.com/seqan/vaquita

git clone https://github.com/seqan/vaquita.git 
mkdir vaquita-build && cd vaquita-build 
cmake ../vaquita && make vaquita -j 4
cd bin/

> ./vaquita

$ ./vaquita

Vaquita - Possible commands

===========================

SYNOPSIS

Vaquita [COMMAND] [ARGUMENTS]

DESCRIPTION

Vaquita: Identification of Structural Variations using Combined Evidence

Developed by Jongkyu Kim (MPI for Molecular Genetics & Free University of Berlin).

Please visit https://github.com/seqan/vaquita for more information.

REQUIRED ARGUMENTS

COMMAND STRING

One of call and merge.

OPTIONS

-h, --help

Display the help message.

--version-check BOOL

Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,

0, OFF, FALSE, F, and NO. Default: 1.

--version

Display version information.

COMMAND

call

Identify structural variations in a single .bam file.

merge

Merge multilple .vcf files into a single file for multisample genotyping.

VERSION

Last update: 2018-02-20 15:47:37 +0100

Vaquita version: 0.4.0

SeqAn version: 2.4.0

vaquita call --helpでcallのより詳細なオプションが表示される。

> vaquita call --help

$ vaquita call --help

Vaquita - Identification mode

=============================

SYNOPSIS

Vaquita call [OPTIONS] -r [reference.fa] [alignment.bam] > [out.vcf]

DESCRIPTION

Vaquita: Identification of Structural Variations using Combined Evidence

Developed by Jongkyu Kim (MPI for Molecular Genetics & Free University of Berlin).

Please visit https://github.com/seqan/vaquita for more information.

REQUIRED ARGUMENTS

ALIGNMENT(.bam) INPUT_FILE

Valid filetype is: .bam.

OPTIONS

-h, --help

Display the help message.

--version-check BOOL

Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,

0, OFF, FALSE, F, and NO. Default: 1.

--version

Display version information.

General:

-r, --referenceGenome INPUT_FILE

Genome sequence file(.fa). Valid filetype is: .fa.

-c, --cutoff INTEGER

Minimum number of supporting read-pairs and split-reads. Default: 4.

-v, --minVote INTEGER

Minimum number of evidence types(=vote) that support SVs for rescue. -1: Supported by all evidence types.

Default: -1.

-q, --minMapQual INTEGER

Mapping quaility cutoff. Default: 20.

-m, --minSVSize INTEGER

Structural varation size cutoff. Default: 50.

-a, --adjTol INTEGER

Positional adjacency in nucleotide resolution. Default: 50.

--report-filtered

Report filtered result

--no-pe

Do not use read-pair evidence.

--no-ce

Do not use soft-clipped evidence.

--no-re

Do not use read-depth evidence.

--no-rank-aggregation

Do not use rank-aggregation for prioritization.

Split-read evidence:

-ss, --maxSplit INTEGER

Maximum number of segments in a single read. Default: 2.

-so, --maxOverlap INTEGER

Maximum allowed overlaps between segements. Default: 20.

-se, --minSplitReadSupport DOUBLE

SVs supported by >= se get a vote. Default: 1.

Read-pair evidence:

-ps, --pairedEndSearchSize INTEGER

Size of breakpoint candidate regions. Default: 500.

-pi, --abInsParam DOUBLE

Discordant insertion size: median +/- (MAD * pi) Default: 9.0.

-pd, --depthOutlier DOUBLE

Depth outlier: {Q3 + (IQR * pd)} Default: 1.0.

-pe, --minPairSupport DOUBLE

SVs supported by >= pe get a vote. Default: 1.

Soft-clipped evidence:

-cs, --minClippedSeqSize INTEGER

Minimum size of clipped sequence to be considered. Default: 20.

-ce, --clippedSeqErrorRate DOUBLE

Maximum edit distance: floor{length of clipped sequence * (1 - ce)}. Default: 0.1.

Read-depth evidence:

-rs, --samplingNum INTEGER

Number of random sample to estimate the background distribution(Q3, IQR, ..) of read-depth evidence.

Default: 100000.

-rw, --readDepthWindowSize INTEGER

Window size to caclulate average read-depth around breakpoints. Default: 20.

--use-re-for-bs

Use RE for balanced SVs(eg. inverison).

-re, --reThreshold DOUBLE

SVs satisfy read-depth evidence >= {Q3 + (IQR * re)} get a vote. Default: 1.0.

VERSION

Last update: 2018-02-20 15:47:37 +0100

Vaquita version: 0.4.0

SeqAn version: 2.4.0

ラン

リファレンスとbamを指定して実行する（bam.baiとfa.faiも必要）。

vaquita call -r ref.fa input.bam > output.vcf

-r Genome sequence file(.fa). Valid filetype is: .fa.

リファレンスFASTAの拡張子は".fa"でないといけない。ランが終わるとvcfファイルが出力される。ターゲットは、Deletion、duplication、inversionなどになる。

f:id:kazumaxneo:20180427222501j:plain

ベンチマークで使われたスクリプト

https://github.com/xenigmax/vaquita_WABI2017

osxでのビルドは容易で、コマンドも簡単です。ユーザーフレンドリーなツールだと思います。

引用

Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence

Kim, Jongkyu ; Reinert, Knut

Workshop on Algorithmic Bioinformatics (WABI) 2017

DOI: 10.4230/LIPIcs.WABI.2017.13