臨床向けアンプリコンシーケンス自動解析パイプライン Canary

　臨床診断は、ヌクレオチドレベルで患者DNAを分析することができる技術によって変化している。臨床シーケンシングの精度、処理時間および再現性は、rawシーケンシングデータを有意義なバリアントに変換するバイオインフォマティクスパイプラインに大きく依存している。これらのパイプラインは、複数のソフトウェア依存、ポータビリティの欠如、複雑なパラメータ調整、および並列実行のためのクラスタコンピューティング環境を必要とすることが多い[論文より　ref.1]。これらの属性により、現場での展開が困難なパイプラインが発生する。

　ここでは、マルチツール・パイプラインの機能を実行するスタンドアロンのJavaユーティリティであるCanaryを紹介する。Canaryはアンプリコン・アッセイから生成された圧縮FASTQファイルから変異遺伝子のアノテーション付きVCFファイルを直接生成することができる。 CanaryはJavaランタイムのみを必要とするため、現在のパイプラインの無数の依存関係とは対照的に、Javaをインストールした任意のコンピュータに展開できる。さらに、パブリックリポジトリからDocker [ref.2]イメージとして入手でき、単一のコマンドでDockerをサポートするどのプラットフォームでもインストールおよび実行できる（％docker run -v /tmp/data:/canary.data dockercanary / canary）。

　ショットガンシーケンス解析パイプラインを用いたアンプリコンシーケンスデータの処理は、スピードと品質の点で最適以下の結果を導く[ref.3,4]。 Illumina BaseSpace [ref.5]のような市販のプラットフォームの外でアンプリコンデータを処理するオプションは比較的少数である。このプラットフォームは、Amplicon TruSeqやTruSightなどの独自のIlluminaアッセイにのみ対応し、他の遺伝子を標的とするカスタムパネルや、アンプリコン解析を社内のパイプラインに組み込むことはできない。非営利のアンプリコンソフトウェアには、アラインメントを行わないMutascope [ref.4]、バリアントコールを行わないAmpliVar [ref.6]、ノーマライズしないUNDR ROVER [ref.3]がある。 Canaryは、圧縮されたFASTQファイルから、臨床的なキュレーションに適したアノテーション付きVCFファイルを、単一のコマンドで実行し必要なパイプライン手順を簡潔にする。 FASTQファイルは、FASTQC [ref.7]などのプログラムで事前にクオリティ管理されているものとする。

　著者らが知る限り、Canaryは、単一の実行可能プログラムで、アライメント、バリアントコール、正規化、トランスクリプトの選抜、およびアノテーションの必要なパイプラインステップすべてを実行できる唯一のツールである。

実行可能な処理の詳細（論文リンク）。

インストール

mac os10.12のdocker環境でテストした。

依存

Java JDK 1.7 from Oracle
Groovy 2.1.9 from here
Gradle for building Canary (we use 1.10 at time of writing) Gradle 1.10
Genome Analysis Toolkit (Currently GATK 3.3) and the Sting utility JAR (currently 2.1.8) available from here
The PathOS Core library available from PathosCore-all-1.3.jar and maintained here.
JNI wrapper to the striped Smith-Waterman alignment library SSW see here

本体　Github

https://github.com/PapenfussLab/Canary

依存が多いので、ここではdockerコンテナでテストする。

docker run -v /tmp/data:/canary.data dockercanary/canary

#ここではコンテナに入って作業を行う。公式の説明のようにしてもOK。
docker run -it --entrypoint 'bash' dockercanary/canary

> Canary

# Canary

error: Missing required options: a, p

usage: Canary [options] read1.fastq.gz read2.fastq.gz

Available options (use -h for help):

-a,--amplicon <arg> Amplicon FASTA file [required]

-ano,--annotation <arg> File of MyVariant annotation fields

-b,--bam <arg> Optional BAM file of alignment

-c,--complex Coalesce complex events aka MNPs

-cols,--columns <arg> File of VCF field names to output to TSV (one

per line with optional alias after comma)

-d,--debug Turn on debugging (Note: will generate large

file of alignments [debug.out])

-f,--flank <arg> Size of flanking region (bp) [5]

-fastq,--fastq <arg> Optional FASTQ output files prefix

-filt,--filter <arg> List of comma separated amplicon names to use

-h,--help This help message

-maxmut,--maxmut <arg> Maximum number of mutations allowed per read

pair [10]

-minpair,--minpair <arg> Min read pairs for variants [10]

-mnpgap,--mnpgap <arg> Maximum size of inter mutation gap for complex

mutations [15]

-mnpmax,--mnpmax <arg> Maximum size of complex mutations [30]

-mut,--mutalyzer <arg> Mutalyzer annotation server host

https://mutalyzer.nl

-n,--nocache Dont use read cache

-norm,--normalise <arg> Generates annotated VCF file from VCF output

-o,--output <arg> Output report file

-p,--primers <arg> Amplicon Primers file [required]

-r,--reads <arg> Percent of reads to process [100]

-t,--tsv <arg> TSV (Tab separated variable) file of VCF

output

-ts,--transcript <arg> File of transcripts mapping genes -> refseq

(without version)

-v,--vcf <arg> Found variants VCF file [canary.vcf]

-vaf,--vaf <arg> Minimum VAF for variants [3.0%]

-ver,--version Display Canary version and exit

ラン

テストラン用のシェルスクリプトを使って動作を確認する。

mkdir ~/test_dir
cd ~/test_dir

#テストラン実行
/opt/Canary/bin/runCanary.sh

このスクリプトを実行すると、Miseq のTruSeqライブラリで実行されたがん原遺伝子48個のアンプリコンシーケンスデータ（/opt/Canary/Fastq）の分析が自動で行われる。

シェルスクリプトの中身は以下のようになっている。

> cat /opt/Canary/bin/runCanary.sh

#!/bin/bash

# Run Canary for testing

Canary --mutalyzer 'https://mutalyzer.nl' \

--amplicon ${CANARY_HOME}/Amplicon/amplicon.fa \

--primers ${CANARY_HOME}/Amplicon/amplicon.primers.tsv \

--transcript ${CANARY_HOME}/etc/transcript.tsv \

--columns ${CANARY_HOME}/etc/cols.txt \

--annotation ${CANARY_HOME}/etc/myvariant.txt \

--reads 100 \

--vaf 3.0 \

--normalise canary.norm.vcf \

--tsv outvcf.tsv \

--output canary.tsv \

--vcf canary.vcf \

--bam canary.bam \

$* \

${CANARY_HOME}/Fastq/*R1_001.fastq.gz \

${CANARY_HOME}/Fastq/*R2_001.fastq.gz

> head ${CANARY_HOME}/Amplicon/amplicon.fa

# head ${CANARY_HOME}/Amplicon/amplicon.fa

>1:43814982-43815163

CCGTCCTGGGCCTGCTGCTGCTGAGGTGGCAGTTTCCTGCACACTACAGGTACCGCCCCC

GCCAGGCAGGAGACTGGCGGTGGACCAGGTGGAGCCGAAGGCCTGTAAACAGGCATTCTT

GGTTCGCTCTGTGACCCCAGATCTCCGTCCACCGCCCGTGCGCACCTACGGCTTCGCCAC

>1:115256500-115256680

ATTGGTCTCTCATGGCACTGTACTCTTCTTGTCCAGCTGTATCCAGTATGTCCAACAAAC

AGGTTTCACCATCTATAACCACTTGTTTTCTGTAAGAATCCTGGGGGTGTGGAGGGTAAG

GGGGCAGGGAGGGAGGGAAGTTCAATTTTTATTAAAAACCACAGGGAATGCAATGCTATT

> head ${CANARY_HOME}/Amplicon/amplicon.primers.tsv

# head ${CANARY_HOME}/Amplicon/amplicon.primers.tsv

1:43814982-43815163 24 26 MPL1_2.chr1.43815008.43815009_tile_1

1:115256500-115256680 26 27 NRAS1_7.chr1.115256528.115256531_tile_1

1:115258702-115258884 26 29 NRAS8_13.chr1.115258730.115258748_tile_1

2:29432636-29432822 27 27 ALK1.chr2.29432664.29432664_tile_1

2:29443667-29443836 25 26 ALK2.chr2.29443695.29443695_tile_1

2:132181238-132181750 30 25 Off_target_7_GNAQ_5.chr9.80409379.80409508_tile_1-GNAQ_7.chr9.80336240.80336429_tile_1

2:132181332-132181839 25 24 Off_target_9_GNAQ_7.chr9.80336240.80336429_tile_2-GNAQ_5.chr9.80409379.80409508_tile_2

2:132181448-132181600 25 28 Off_target_8_GNAQ_6.chr9.80343430.80343583_tile_1-GNAQ_7.chr9.80336240.80336429_tile_3

2:209113084-209113264 27 27 IDH1_1_2.chr2.209113112.209113113_tile_1

2:212288912-212289100 27 27 ERBB4_1_2.chr2.212288942.212288955_tile_1

独自に解析するには、これらのファイルを準備する。primerのフォーマット詳細はGithub参照（リンク）。

ランが終わると、bamに加え、バリアントをアノテーション付きでまとめたファイル（canary.vcf、outvcf.tsv）が出力される。canary.vcfは変異遺伝子のアノテーション付きVCF。outvcf.tsvはタブ区切りのファイルで、スプレッドシートやデータベースに読み込みやすいように工夫されている。

# head -n 50 outvcf.tsv |tail -n 10

##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting bases on forward strand (reads1plus)">

##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting bases on reverse strand (reads1minus)">

##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting bases on forward strand (reads2plus)">

##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting bases on reverse strand (reads2minus)">

##unpack="expanded by org.petermac.util.Vcf.unpack()"

#chr position ID REF altbases QUAL FILTER GT GQ HGVSg HGVSc HGVSp gene genename consequence cadd_raw mutdb.mutpred_score mutdb.cosmic_id mutdb.uniprot_id mutdb.strand mutdb.rsid clinvar.allele_id clinvar.gene.id clinvar.gene.symbol clinvar.cytogenic clinvar.variant_id clinvar.rcv

2 212578379 . TA T . PASS 0/1 chr2:g.212578380del

2 212578379 . TAA T . PASS 0/1 chr2:g.212578380_212578381del

2 212578379 . TAAA T . PASS 0/1 chr2:g.212578380_212578382del

2 212578379 . T TA . PASS 0/1 chr2:g.212578379_212578380insA

引用

Canary: an atomic pipeline for clinical amplicon assays

Doig KD, Ellul J, Fellowes A, Thompson ER, Ryland G, Blombery P, Papenfuss AT, Fox SB

BMC Bioinformatics. 2017 Dec 15;18(1):555.