バリアントのコールと可視化のパイプライン MutScan

　次世代シーケンシング（NGS）は何千もの突然変異を検出することができる。しかし、一部のアプリケーションでは、これらのうちのほんのわずかなものが対象のターゲットである。 NGS技術によるがんの個人化された医療検査のようなアプリケーションでは、臨床医と遺伝カウンセラーは、通常、薬物治療可能な突然変異の検出に焦点を当てている[論文より ref.1]（一部略）。これらの突然変異は、患者の無細胞腫瘍DNA（ctDNA）のディープシーケンスによって検出することができる[ref.3]。しかし、ctDNAシーケンスでコールされるバリアントの突然変異対立遺伝子頻度（MAF）は非常に低い。典型的には、MAFは通常5％以下であり、0.1％という低い値であることすらあり得る[ref.4 pubmed]。そのような低いMAFを用いた突然変異の検出の必要性は、ctDNAシーケンスデータを分析する高感度な方法の開発を推進する[ref.4]。

　NGSデータの通常の突然変異検出パイプラインは、通常、各ステップごとに異なるツールを含む。例えば、著者たちの通常の腫瘍変異コールパイプラインでは、データ前処理のためのAfter [ref.5 紹介]、マッピングのためのBWA [ref.6]、パイプアップ生成のためのSamtools [ref.7]、バリアントコールのためのVarScan2 [ref.8]など多くの補助ツールが必要である。これらのステップで使用されるさまざまなツールは、適用されるフィルタが異なるために情報が失われる可能性があり、最終的に特にMAFが低いものが誤検出される可能性がある。このタイプのデータ分析によるfalse negativesは、患者のよりよい治療の機会を逃す可能性があるため、臨床応用では許容できない。

　対照的に、高価だが効果のない治療法を導入する可能性があるため、鍵突然変異のfalse positives検出も避けるべきである。false positivesによる間違った治療法は重大な副作用を引き起こすことさえある[ref.10 pubmed]。従来のNGSパイプラインは多くの置換とINDELを検出することができるが、必然的に誤検出を引き起こす。特に、アライナーのリファレンスゲノムへの不正確なマッピングのために、ゲノムの高度に反復する領域において偽陽性突然変異が検出され得る。この誤ったコール頻度を減らすには、すべての重要な突然変異を検証する必要がある[ref.11]。バリアントの視覚化は、突然変異の信頼性を手動で確認する重要な方法である。 IGV [ref.12]やGenomeBrowseなどのツールを使用してバリアントのビジュアライゼーションを行うことができるが、これらのツールには低速で非効率なBAMファイルの操作が必要である。特に、非常にディープなシーケンスデータにおいて低いMAF変異を視覚化すると、IGVまたはGenomeBrowseは、突然変異したリードを何千ものリード中に配置することが困難になるため不便である。したがって、高速で軽量でクラウドに優しいバリアントの視覚化ツールが必要である。

　ここで紹介されているツールMutScanは、これらの問題に対処するために特別に設計されている。エラー耐性のある文字列検索アルゴリズムを基に構築されており、ローリングハッシュ[ref.13]とブルームフィルタ[14]を使用して速度を最適化している。 MutScanは、CSVファイルまたはプログラムであらかじめ定義されているターゲット変異を検出するためリファレンスフリーモードでも実行できる。 VCFファイルとそれに対応するリファレンスゲノムのFastAファイルを提供することで、MutScanはこのVCF内のすべてのバリアントをスキャンし、各バリアント用のHTMLページをレンダリングすることによって視覚化することができる。

MutScanのワークフロー。論文より転載。

特徴

Ultra sensitive, guarantee that all reads supporting the mutations will be detected
Can be 50X+ faster than normal pipeline (i.e. BWA + Samtools + GATK/VarScan/Mutect).
Very easy to use and need nothing else. No alignment, no reference genome, no variant call, no...
Contains built-in most actionable mutation points for cancer-related mutations, like EGFR p.L858R, BRAF p.V600E...
Beautiful and informative HTML report with informative pileup visualization.
Multi-threading support.
Supports both single-end and pair-end data.
For pair-end data, MutScan will try to merge each pair, and do quality adjustment and error correction.
Able to scan the mutations in a VCF file, which can be used to visualize called variants.
Can be used to filter false-positive mutations. i.e. MutScan can handle highly repetive sequence to avoid false INDEL calling.

想定されるシナリオ

you are interested in some certain mutations (like cancer drugable mutations), and want to check whether the given FastQ files contain them.
you have no enough confidence with the mutations called by your pipeline, so you want to visualize and validate them to avoid false positive calling.
you worry that your pipeline uses too strict filtering and may cause some false negative, so you want to check that in a fast way.
you want to visualize the called mutation and take a screenshot with its clear pipeup information.
you called a lot of INDEL mutations, and you worry that mainly they are false positives (especially in highly repetive region)
you want to validate and visualize every record in the VCF called by your pipeline.

ガン原遺伝子の探索ではbuilt-inのデータベースを使えるので、raw シーケンスデータから新規に変異解析を実行できます。それ以外のケースでは、あらかじめ別のツールで解析してvcfを得る必要があります。本ツールにvcfを入力することで、簡単なフィルタリングと結果の可視化を行えます。

GithubにSample reportが用意されている。

http://opengene.org/MutScan/report.html

インストール

mac os 10.12でテストした。

Github

GitHub - OpenGene/MutScan: Detect and visualize target mutations by scanning FastQ files directly

git clone https://github.com/OpenGene/mutscan.git
cd mutscan 
make
sudo make install

> mutscan

$ mutscan

usage: mutscan --read1=string [options] ...

options:

-1, --read1 read1 file name (string)

-2, --read2 read2 file name (string [=])

-m, --mutation mutation file name, can be a CSV format or a VCF format (string [=])

-r, --ref reference fasta file name (only needed when mutation file is a VCF) (string [=])

-h, --html filename of html report, default is mutscan.html in work directory (string [=mutscan.html])

-t, --thread worker thread number, default is 4 (int [=4])

-S, --support min read support for reporting a mutation, default is 2 (int [=2])

-k, --mark when mutation file is a vcf file, --mark means only process the records with FILTER column is M

-l, --legacy use legacy mode, usually much slower but may be able to find a little more reads in certain case

-s, --standalone output standalone HTML report with single file. Don't use this option when scanning too many target mutations (i.e. >1000 mutations)

--simplified simplified mode uses less RAM but reports less information. This option can be auto/on/off, by default it's auto, which means automatically enabled when processing large FASTQ with large VCF. (string [=auto])

-v, --verbose enable verbose mode, more information will be output in STDERR

-?, --help print this message

——

ラン

テストデータのダウンロード。

http://opengene.org/dataset.html

がん原遺伝子を探索するなら、built-inのデータベースと称号するため、ペアエンドデータを指定するだけでランできる（MutScan contains a built-in list with most actionable gene mutations for cancer diagnosis [18]. 論文より）。

mutscan -1 R1.fq.gz -2 R2.fq.gz -t 8

-1 read1 file name (string)
-2 read2 file name (string [=])
-t worker thread number, default is 4 (int [=4])

シングルエンドのシーケンスデータは"-1"で指定する。出力ファイル名を指定するなら-hフラグを立てて、"-h output.html"などと書く。

ラン結果はhtmlで出力される。

f:id:kazumaxneo:20180513175338j:plain

htmlを開く。コール部位がまとめられている。

f:id:kazumaxneo:20180513190302j:plain

１つ開いてみる。真ん中のCが変異部位。６リード変異をサポートしている。

f:id:kazumaxneo:20180513190630j:plain

書いてある通り文字の色でベースクオリティを表している。赤がlow qualityなベース。

リードをクリックするとraw sequence readが表示される。

f:id:kazumaxneo:20180513190847j:plain

デフォルトでは最低２以上バリアントをサポートするリードがないと出力しない。フィルタリング感度を変えるには-Sフラグをつける(e.g., "-S 1"）。

リダイレクトするとプレーンテキストで出力される。

mutscan -1 R1.fq.gz -2 R2.fq.gz -t 8 > result.txt

> head result.txt

$ head result.txt

---------------

NRAS-neg-1-115258748-2-c.34G>T-p.G12C-COSM562 GGATTGTCAGTGCGCTTTTCCCAACACCAC A TGCTCCAACCACCACCAGTTTGTACTCAGT chr1

1, pos: 86, distance: 0, reverse

@NB551106:59:HTFV3BGX2:1:11102:9413:4074 1:N:0:AGTCAA

GGGCCTCACCTCTATGGTGGGATCATATTCATCTACAAAGTGGTTCTGGATTAGCT GGATTGTCAGTGCGCTTTTCCCAACACCAC A TGCTCCAACCACCACCAGTTTGTACTCAGT CATTTCACACCAGCAAGAACCTGTTGGAAACC

AAEEEAEEEEEEEEEAAEAAAEEEEEE/EEEAEEEEAEEEEEEEAEEEAEEEEEEE AEEAEEEEEEEEEEEEEEEAEEEEEEEEEE E AEEEEEEEEEEEEEEEEE/EEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA

2, pos: 86, distance: 0, reverse

@NB551106:59:HTFV3BGX2:1:11102:9413:4074 1:N:0:AGTCAA

がん原遺伝子のデータベース以外を可視化するには、vcfを指定してランする。vcf指定時はリファレンスファイルも指定する必要がある（.csvならリファレンスは必要ない）。

mutscan -1 R1.fq.gz -2 R2.fq.gz -t 8 -r ref.fa -m input.vcf

-m mutation file name, can be a CSV format or a VCF format
-r reference fasta file name (only needed when mutation file is a VCF) (string [=])

vcfを与える場合、built-inのデータベースに依存しないため、どのような変異でも分析して可視化できる（記載はないがヒト以外でも動作する）。SVも可視化できそうだが、50~100-bp以上のSVでは横長になって視認性が悪いので、SV専用の可視化ツールの方が適していると思われる（samplot、SVPV）。

引用

MutScan: fast detection and visualization of target mutations by scanning FASTQ data

Shifu Chen,Tanxiao Huang, Tiexiang Wen, Hong Li, Mingyan Xu, and Jia Gu

BMC Bioinformatics. 2018; 19: 16.