高速なgermlineとsomaticのSV検出ツール Manta - macでインフォマティクス

　ゲノムシーケンシングおよびゲノムエンリッチメントシーケンシングは、臨床での遺伝性および体細胞突然変異発見のためにますます使用されてきているが、このシナリオにおける構造変異（SV）およびindelsの迅速な発見のためのツールは限られている。著者らは、SVと中サイズのindelsと大きなサイズの挿入を統一された迅速なプロセスで正確に検出し、評価するための新しい方法であるMantaを使いこのギャップに取り組む。 Mantaは、シークエンシングアッセイのペアエンドとスプリットマッピングから効率的なパラレルワークフローでバリアントを検出する。現在、研究とpopulation genomicsに焦点を当てた多くの高度な構造変異検出ツールが利用可能である（Layer et al、2014; Rausch et al、2012; Sindi et al、2012; Ye et al、2009）。しかし、著者らの知る限り、関連するサンプルの個々のセットまたは小さなセットに焦点を当てた迅速なワークフローに、多くのバリアントタイプを組み合わせることはできない。臨床パイプラインに重点を置くMantaは、リファレンスゲノムと任意の標準リードマッパーからのアラインメント（bam）のみを使用して、検出、アセンブリ、スコアリングのための完全なソリューションを提供する。これは、二倍体個体の生殖系列分析および腫瘍 - ノーマルサンプルペアの体細胞分析のためのスコアリングモデルを提供し、現在開発中のRNA-Seq、de novo変異、および他に類を見ない腫瘍のためのさらなる応用を提供する。（一部略）。

　Mantaのワークフローは、個々のサンプルまたは小さなサンプルセットで高い並列性を実現するように設計されている。それは2つのフェーズで動作する。（一部略）すべてのワークフローコンポーネントの詳細は補足資料（ダイレクトリンク）のメソッドで説明されている。

bamからSVが予測された領域についてgraphを作成し、その graphからバリアントを予測する。正確な予測ができなかった場合、ペアエンド情報だけから予測され、IMPRECISEのタグがつく。

f:id:kazumaxneo:20180730132404j:plain

Mantaのワークフロー。論文補足資料より転載（ダイレクトリンク）。

著者らの環境では、よく使われるゴールデンスタンダードindelのplatinum genomes NA12878（家系図 CEPE pedigree）（これ以外にもGenome in a bottle (GIAB) のデータがある）の50xシーケンシングデータ（illumina）のbamを、Mantaは20分未満で処理できるとされる（環境: 20 physical cores using a dual Xeon E5-2680 v2 server with the BAM accessed from a conventional local drive, peak total memory (RSS) for this run was 2.35 Gb）。　

ハードウエア必要条件

Memory Typical memory requirements are <1Gb/core for germline analysis and <2Gb/core for cancer/FFPE/highly rearranged samples. The exact requirement depends on many factors including sequencing depth, read length, fragment size and sample quality.
CPU Manta does not require or benefit from any specific modern CPU feature (e.g. NUMA, AVX..), but in general faster clock and larger caches will improve performance.
I/O I/O can be roughly approximated as 1.1 reads of the input alignment file per analysis, with no writes that are significant relative to the alignment file size.

MantaはWGSとWES向けのツールで、以下のような解析に対応している。

Joint analysis of small sets of diploid individuals (where 'small' means family-scale -- roughly 10 or fewer samples)
Subtractive analysis of a matched tumor/normal sample pair
Analysis of an individual tumor sample

Mantaは以下のようなSVサブクラス検出に対応している。

Deletions
Insertions
Fully-assembled insertions
Partially-assembled (ie. inferred) insertions
Inversions
Tandem Duplications
Interchromosomal Translocations

Mantaは以下のようなSVサブクラスは検出できない。

Dispersed duplications
Most expansion/contraction variants of a reference tandem repeat
Small inversions
The limiting size is not tested, but in theory detection falls off below ~200bases. So-called micro-inversions might be detected indirectly as combined insertion/deletion variants.
Fully-assembled large insertions
The maximum fully-assembled insertion size should correspond to approximately twice the read-pair fragment size, but note that power to fully assemble the insertion should fall off to impractical levels before this size
Note that manta does detect and report very large insertions when the breakend signature of such an event is found, even though the inserted sequence cannot be fully assembled.

マニュアル

manta/docs/userGuide at master · Illumina/manta · GitHub

Mantaに関するツイート。

8/3 dockerコマンド修正

インストール

mac os10.13でdokcerを使いテストした。

本体　Github

リリースからソースコードのほか、cent os向けバイナリをダウンロードできる。

https://github.com/Illumina/manta/releases

linuxならcondaでも導入できる（リンク）。

ここではdockerコンテナを使う。

docker pull eitanbanks/manta-1.0.3

> docker run --rm -it eitanbanks/manta-1.0.3 /bin/manta/bin/configManta.py -h

$ docker run --rm -i -t -v /Users/user/data/:/data eitanbanks/manta-1.0.3 /bin/manta/bin/configManta.py -h

Usage: configManta.py [options]

Version: 1.0.3

This script configures the Manta SV analysis pipeline.

You must specify a BAM or CRAM file for at least one sample.

Configuration will produce a workflow run script which

can execute the workflow on a single node or through

sge and resume any interrupted execution.

Options:

--version show program's version number and exit

-h, --help show this help message and exit

--config=FILE provide a configuration file to override defaults in

global config file

(/usr/bin/manta/bin/configManta.py.ini)

--allHelp show all extended/hidden options

Workflow options:

--bam=FILE, --normalBam=FILE

Normal sample BAM or CRAM file. May be specified more

than once, multiple inputs will be treated as each BAM

file representing a different sample. [optional] (no

default)

--tumorBam=FILE, --tumourBam=FILE

Tumor sample BAM or CRAM file. Only up to one tumor

bam file accepted. [optional] (no default)

--exome Set options for WES input: turn off depth filters

--rna Set options for RNA-Seq input: turn off depth filters

and don't treat anomalous reads as SV evidence when

the proper-pair bit is set.

--unstrandedRNA Set if RNA-Seq input is unstranded: Allows splice-

junctions on either strand

--referenceFasta=FILE

samtools-indexed reference fasta file [required]

--runDir=DIR Run script and run output will be written to this

directory [required] (default: MantaWorkflow)

Extended options:

These options are either unlikely to be reset after initial site

configuration or only of interest for workflow development/debugging.

They will not be printed here if a default exists unless --allHelp is

specified

--scanSizeMb=INT Maximum sequence region size (in megabases) scanned by

each task during SV Locus graph generation. (default:

12)

--region=REGION Limit the analysis to a region of the genome for

debugging purposes. If this argument is provided

multiple times all specified regions will be analyzed

together. All regions must be non-overlapping to get a

meaningful result. Examples: '--region chr20' (whole

chromosome), '--region chr2:100-2000 --region

chr3:2500-3000' (two regions)'

——

dokcerイメージから表示するなら

docker run --rm -it eitanbanks/manta-1.0.3 /bin/manta/bin/configManta.py -h

> runWorkflow.py -h #上記のコマンドを打つと、このpyflowのスクリプトができる。

Usage: runWorkflow.py [options]

Version: 1.4.0

Options:

--version show program's version number and exit

-h, --help show this help message and exit

-m MODE, --mode=MODE select run mode (local|sge)

-q QUEUE, --queue=QUEUE

specify scheduler queue name

-j JOBS, --jobs=JOBS number of jobs, must be an integer or 'unlimited'

(default: Estimate total cores on this node for local

mode, 128 for sge mode)

-g MEMGB, --memGb=MEMGB

gigabytes of memory available to run workflow -- only

meaningful in local mode, must be an integer (default:

Estimate the total memory for this node for local

mode, 'unlimited' for sge mode)

-d, --dryRun dryRun workflow code without actually running command-

tasks

--quiet Don't write any log output to stderr (but still write

to workspace/pyflow.data/logs/pyflow_log.txt)

development debug options:

--rescore Reset task list to re-run hypothesis generation and

scoring without resetting graph generation.

extended portability options (should not be needed by most users):

--maxTaskRuntime=hh:mm:ss

Specify scheduler max runtime per task, argument is

provided to the 'h_rt' resource limit if using SGE (no

default)

Note this script can be re-run to continue the workflow run in case of

interruption. Also note that dryRun option has limited utility when

task definition depends on upstream task results -- in this case the

dry run will not cover the full 'live' run task set.

ラン

１、はじめにマッピングしてbamを作成する。著者らはマッパーにBWA-MEM version 0.7.5a を使っている。

２、Configurationファイルの作成。bin/configManta.pyを使う。入力はマッピングして得たbam（cram）とそのリファレンスfasta。bam(cram)、fasta共にindexファイルも必要。

Single Diploid Sample Analysis

#~/documnets/input_file/とイメージのdata/を共有ディレクトリとする。ただしdockerを使わないなら１行目は不要。またファイルパスの/data/部分も不要。
docker run --rm -itv /Users/uesaka/Documents/input_file:/data eitanbanks/manta-1.0.3 \
/bin/manta/bin/configManta.py \
--bam=/data/NA12878_S1.bam  \
--referenceFasta=/data/hg19.fa  \
--runDir=/data/output

Joint Diploid Sample Analysis（例えばCEPE familyのコホート（ここではTrio））

docker run --rm -itv /Users/uesaka/Documents/input_file:/data eitanbanks/manta-1.0.3 \
/bin/manta/bin/configManta.py \
--bam=/data/NA12878_S1.cram \
--bam=/data/NA12891_S1.cram \
--bam=/data/NA12892_S1.cram \
--referenceFasta /data/hg19.fa \
--runDir=/data/output

Tumor Normal Analysis（case-contorol）

docker run --rm -itv /Users/uesaka/Documents/input_file:/data eitanbanks/manta-1.0.3 \
/bin/manta/bin/configManta.py \
--normalBam=/data/HCC1187BL.cram \
--tumorBam=/data/HCC1187C.cram \
--referenceFasta=/data/hg19.fa \
--runDir=/data/output

Tumor-Only Analysis

docker run --rm -itv /Users/uesaka/Documents/input_file:/data eitanbanks/manta-1.0.3 \
/bin/manta/bin/configManta.py \
--tumorBam=/data/HCC1187C.cram \
--referenceFasta=/data/hg19.fa \
--runDir=/data/output

３、 Configurationファイルの実行（シングルノードでの実行。SGE clusterのランはGithub参照）(*1)。

#dockerを使わないなら1行目は不要
docker run --rm -itv /Users/uesaka/Documents/input_file:/data eitanbanks/manta-1.0.3 \
data/output/runWorkflow.py -m local -j 8 --memGb=20

ジョブが終わると、germline解析では３つのVCFファイルが出力される（results/variants）。turmor/nomal解析ではさらにもう１つVCFができ（tumorSV.vcf）、合計４つのVCFが出力される。VCFのフォーマットはversion4.1に則っている。ポジションはSVのleft-shifted breakend coordinateが報告される。

１、diploidSV.vcf.gz

２、somaticSV.vcf.gz

３、candidateSV.vcf.gzとそのサブセットのcandidateSmallIndels.vcf.gz

４、tumorSV.vcf.gz

他にもStatisticsファイルが出力される。

１、alignmentStatsSummary.txt

２、svLocusGraphStats.tsv

３、svCandidateGenerationStats.tsv

４、svCandidateGenerationStats.xml

複数サンプルがある場合、bamの@RGのサンプル名（@RGのSM:のところ）をspecificな名前にしてbamを作ってください。bamを作った後に修正するならまとめのsamtools view部分参照。

GIhubだけでもかなり丁寧に説明されています。まずGithubのマークダウンのマニュアル（リンク）に目を通し、方法論についての詳細は論文（publishされたのはBioinfomaticsジャーナルの"SEQUENCE ANALYSIS"）のsupllementary（ダイレクトリンク）を読んでください。

引用

Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications
Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F, Källberg M, Cox AJ, Kruglyak S, Saunders CT

Bioinformatics. 2016 Apr 15;32(8):1220-2

"docker run"時、必要に応じてメモリlimitのオプションも使って下さい。

60G以上になるとkillするなら -m "60g" 。