バリアントコールのVCFを可視化する VIVA - macでインフォマティクス

　次世代シーケンシングにより、膨大な量のゲノムデータが生成される。ゲノム情報の量は、研究によって異なる。バリアント検出プロセスでは、さまざまな種類のファイル形式が生成される。シーケンス解析で一般的に使用されるファイル形式の1つは、バリアントコールフォーマット（VCF）である。これは、バリアントコールプロセスで生成されるテキストファイル形式で、ゲノム内のバリアントの位置に関する情報が含まれている。この構造には、各ゲノム位置のサンプルの遺伝子型やリードデプスデータなどのバリアント情報が含まれる。リードデプスは、各患者の各バリアント位置でのシーケンスカバレッジの測定値である。VCFファイルには、メタ情報の行、サンプルIDを含むヘッダー、特定のバリアントのゲノム位置のデータ行が含まれる。研究者や臨床医が次世代のシーケンシングにアクセスしやすくなるため、VCFファイルからゲノムデータを簡単に取得して視覚化する機能が必要である。これは、トランスレーショナル医療および個別化医療に有益となる。

　VCFファイルからのデータの解釈には、いくつかの課題がある。多くの場合、ファイルサイズは非常に大きいため、VCFファイルを処理する機能は計算リソースによって制限される。メモリ効率の良いデータ取得を容易にするために、既存のVCFファイル解析および視覚化ツールでは、ユーザーがVCFファイルを前処理する必要がある。これには、VCFToolsなどの外部プログラムでファイルをサブセット化する前、またはTabix2でファイルのインデックスを作成する前に、ゲノム位置によってVCFファイルを圧縮およびソートする必要がある。さらに、VCFデータ構造は密度が高く、その生データ形式で解釈するのが難しく、洞察を引き出すためにデータクエリが必要である。効率的な解釈とデータ共有を促進するために、使いやすいVCFファイル解析および視覚化ツールが必要である。

　VCFファイルからのシーケンス実験のバリアント分析と品質管理のためにゲノムデータを評価および共有するためのコマンドラインユーティリティおよびJupyter Notebookベースのツール“Visualization of Variants” （VIVA）を紹介する。 VIVAは、vcfR、IGV5、Genome Browser、Genome Savant、svviz、jvarkit – JfxNgsなどの既存の類似ツールと比較して、柔軟性、効率性、使いやすさを提供する。 VIVAの際立った機能は次のとおりである。（1）VCFファイルの前処理（圧縮、並べ替え、またはインデックス作成を含む）が不要、（2）サンプルメタデータによってデータを並べ替え、視覚化する機能、（3）コーディングが不要、（4）さまざまなpubliation品質の出力形式、（5）リアルタイムのデータ探索と共有のためのインタラクティブなHTML5出力、そして（6）ヒートマップデータをテキストファイルマトリックスとしてエクスポートし、他のツールを使用して分析する。

　これを実現するために、VIVAは数値計算用の高レベルで高性能な動的プログラミング言語であるjuliaプログラミング言語を採用している。juliaは、ジュリアプログラミング言語で書かれたこの種の最初のツールであり、生物学者および生物情報学者向けのjulia言語コミュニティであるBioJuliaがホストする他のツールとワークフローに統合できる。

VIVAを使用するには、論文図1に示す3つの主要な手順が必要である。

（1）入力ファイルを送信し、必要に応じてフィルタリングオプションを選択する。

（2）VIVAはVCFファイルを読み取り、データを処理する。

（3）VIVAはグラフを作成し、出力ファイルをエクスポートする。

（複数段落省略）

VIVAは複数の視覚化オプションをサポートしている。これらには、遺伝子型のヒートマップや、列のサンプルと行のバリアント位置を含むリードデプスデータが含まれる。遺伝子型ヒートマップは、遺伝子型の値を表示するカテゴリヒートマップである。ホモ接合リファレンス、ヘテロ接合バリアント、ホモ接合バリアント、または選択したすべてのサンプルとバリアントのコールなし。リードデプスヒートマップは、0〜100の連続リードデプス値のプロット、またはコール無しである。「コール無し」は、この時点までのVCF生成中にデータ品質が低かったことを示している。 100を超えるリードデプスの外れ値は、視覚化で低いリードデプス値の解像度が失われないように制限される。リードデプスの上限である100が選択されたのは、ほとんどの目的で、30 +のリードデプス値がバリアント分析に含めるのに十分であるためである。 VIVAは、サンプルまたはバリアント位置全体の平均リードデプスの散布図も生成できる。ユーザーはこれらを使用して、論文図2に示すように、シーケンスが困難な領域にあるサンプルやバリアントの問題を特定できる。さらに、外部の分析のために、代表的な遺伝子型値または連続リードデプス値のラベル付きデータマトリックスを保存することを選択できる。

インストール

ubuntu18.04のJulia 1.2.0でテストした。

依存

#julia

juliaはHPの指示に従って導入する。

本体　Github

#インストールは以下のコマンドを実行するだけ（fetchしてビルドされる）。
julia
]add VarianatVisualization
exit()

>viva -h

usage: viva -f VCF_FILE [-o OUTPUT_DIRECTORY] [-s SAVE_FORMAT]

[-r GENOMIC_RANGE] [-p] [-l POSITIONS_LIST]

[-g GROUP_SAMPLES GROUP_SAMPLES]

[--select_samples SELECT_SAMPLES] [-m HEATMAP]

[-y Y_AXIS_LABELS] [-x] [-n] [-t HEATMAP_TITLE]

[--avg_dp AVG_DP] [--save_remotely] [-h]

VIVA VCF Visualization Tool is a tool for creating publication quality

plots of data contained within VCF files. For a complete description

of features with examples read the docs here

https://github.com/compbiocore/VariantVisualization.jl

optional arguments:

-f, --vcf_file VCF_FILE

vcf filename in format: file.vcf

-o, --output_directory OUTPUT_DIRECTORY

function checks if directory exists and saves

there, if not creates and saves here (default:

"output")

-s, --save_format SAVE_FORMAT

file format you wish to save graphics as (eg.

pdf, html, png). Defaults to html (default:

"html")

-r, --genomic_range GENOMIC_RANGE

select rows within a given chromosome range.

Provide chromosome range after this flag in

format chr4:20000000-30000000.

-p, --pass_filter select rows with PASS in the FILTER field.

-l, --positions_list POSITIONS_LIST

select variants matching list of chromosomal

positions. Provide filename of text file

formatted with two columns in csv format:

1,2000345.

-g, --group_samples GROUP_SAMPLES GROUP_SAMPLES

group samples by common trait using user

generated matrix key of traits and sample

names following format guidelines in

documentation. Provide file name of .csv file

--select_samples SELECT_SAMPLES

select samples to include in visualization by

providing tab delimited list of sample names

(eg. samplenames.txt). Works for heatmap

visualizations and numeric array generation

only (not average dp plots)

-m, --heatmap HEATMAP

genotype field to visualize (eg. genotype,

read_depth, or 'genotype,read_depth' to

visualize each separately) (default:

"genotype,read_depth")

-y, --y_axis_labels Y_AXIS_LABELS

specify whether to label y-axis with all

chromosome positions (options = positions /

chromosome) separators. Defaults to chromosome

separators. (default: "chromosomes")

-x, --x_axis_labels flag to specify whether to label x-axis with

sample ids from vcf file. Defaults to FALSE.

-n, --num_array flag to save numeric array of categorical

genotype values or read depth values before

heatmap plotting. Must be used with --heatmap

set.

-t, --heatmap_title HEATMAP_TITLE

Specify filename for heatmap with underscores

for spaces.

--avg_dp AVG_DP visualize average read depths as line chart.

Options: average sample read depth, average

variant read depth, or both. eg. =sample,

=variant, =sample,variant

--save_remotely Save html support files online rather than

locally so files can be shared between

systems. Files saved in this way require

internet access to open.

-h, --help show this help message and exit

Thank you for using VIVA. Please submit any bugs to

https://github.com/compbiocore/VariantVisualization.jl/issues

一部のコンポーネントが正しく導入されない。

実行方法

VCFを指定する。

viva -f filename.vcf -s html -o output_dir

導入に成功したら追記します。

依存
VIVA (VIsualization of VAriants): A VCF File Visualization Tool
G. A. Tollefson, J. Schuster, F. Gelin, A. Agudelo, A. Ragavendran, I. Restrepo, P. Stey, J. Padbury, and A. Uzun

Sci Rep. 2019; 9: 12648. Published online 2019 Sep 2.