構造多型部位のマッピング状況を出力する samplot

2020 9/26 Preprint引用、condaによるインストールコマンド、help追記

2021 5/27 論文追記

構造変異（SV）検出において、視覚的な検証は偽陽性を排除するために不可欠なステップである。著者らは、ショートリード、ロングリード、フェーズドリードを含む、複数のサンプルやシーケンス技術にまたがるSVを判定するために必要なリードのデプスと配列のアラインメントを表示するイメージを迅速に作成するためのツールであるSamplotを紹介する。これらのシンプルなイメージは、大規模なSVコールセットをキュレーションするための迅速なレビューに役立つ。Samplotは、疾患研究における原因となる可能性のある変異体の優先順位付け、遺伝的変異のファミリーベースの解析、またはde novo SVレビューなど、多くの生物学的問題に簡単に適用できる。また、Samplotには、人間がレビューしなくても擬陽性の数を劇的に減少させる訓練された機械学習パッケージが含まれている。Samplotは、condaパッケージマネージャまたはhttps://github.com/ryanlayer/samplot から入手できる。

samplotはbamやcramを入力として、SVの起こった領域の図を出力してくれるツール。vcfからの一括描画にも対応しているため、variant call format（VCF）を出力したら、そのままsamplotに送るようなスクリプトを書くことで、推定SV全てを目視で簡単に確認できるようになる。

Samplot: from @jon_belyeu, Chowdhury, @ryanlayer and co, for visualizing read alignment at suspected structural variants, to aid manual curation of SVs. Human input is a good way of resolving false positives from automated SV callers. https://t.co/HIb6AqCq5m pic.twitter.com/bwbNVW43f8
— Genome Biology (@GenomeBiology) 2021年5月25日

インストール

condaでpython3.7の仮想環境を作ってテストした（OSはubuntu18.04）。

依存

numpy
matplotlib
pylab
pysam
statistics
bcftools

本体 Github

https://github.com/ryanlayer/samplot

#2020 9/26 追記
#bioconda (link)
conda crteate -n samplot python=3.7
conda install -c bioconda samplot -y

> samplot -h

$ samplot -h

usage: samplot [-h] [-v] {plot,vcf} ...

optional arguments:

-h, --help show this help message and exit

-v, --version Installed version

[sub-commands]:

{plot,vcf}

plot Plot an image of a genome region from CRAM/SAM alignments,

optimized for structural variant call review

vcf Generates commands to plot images with `samplot plot`, using

VCF file to define regions

> samplot plot -h

$ samplot plot -h

usage: samplot plot [-h] [-n TITLES [TITLES ...]] [-r REFERENCE] [-z Z] -b

BAMS [BAMS ...] [-o OUTPUT_FILE] [--output_dir OUTPUT_DIR]

-s START -e END -c CHROM [-w WINDOW] [-d MAX_DEPTH]

[-t SV_TYPE] [-T TRANSCRIPT_FILE]

[--transcript_filename TRANSCRIPT_FILENAME]

[--max_coverage_points MAX_COVERAGE_POINTS]

[-A ANNOTATION_FILES [ANNOTATION_FILES ...]]

[--annotation_filenames ANNOTATION_FILENAMES [ANNOTATION_FILENAMES ...]]

[--coverage_tracktype {stack,superimpose,none}] [-a]

[-H PLOT_HEIGHT] [-W PLOT_WIDTH] [-q INCLUDE_MQUAL]

[--separate_mqual SEPARATE_MQUAL] [-j]

[--start_ci START_CI] [--end_ci END_CI]

[--long_read LONG_READ] [--ignore_hp]

[--min_event_size MIN_EVENT_SIZE]

[--xaxis_label_fontsize XAXIS_LABEL_FONTSIZE]

[--yaxis_label_fontsize YAXIS_LABEL_FONTSIZE]

[--legend_fontsize LEGEND_FONTSIZE]

[--annotation_fontsize ANNOTATION_FONTSIZE]

[--common_insert_size] [--hide_annotation_labels]

[--coverage_only] [--same_yaxis_scales]

[--marker_size MARKER_SIZE] [--dpi DPI] [--zoom ZOOM]

[--debug DEBUG]

optional arguments:

-h, --help show this help message and exit

-n TITLES [TITLES ...], --titles TITLES [TITLES ...]

Space-delimited list of plot titles. Use quote marks

to include spaces (i.e. "plot 1" "plot 2")

-r REFERENCE, --reference REFERENCE

Reference file for CRAM, required if CRAM files used

-z Z, --z Z Number of stdevs from the mean (default 4)

-b BAMS [BAMS ...], --bams BAMS [BAMS ...]

Space-delimited list of BAM/CRAM file names

-o OUTPUT_FILE, --output_file OUTPUT_FILE

Output file name/type. Defaults to

{type}_{chrom}_{start}_{end}.png

--output_dir OUTPUT_DIR

Output directory name. Defaults to working dir.

Ignored if --output_file is set

-s START, --start START

Start position of region/variant (add multiple for

translocation/BND events)

-e END, --end END End position of region/variant (add multiple for

translocation/BND events)

-c CHROM, --chrom CHROM

Chromosome (add multiple for translocation/BND events)

-w WINDOW, --window WINDOW

Window size (count of bases to include in view),

default(0.5 * len)

-d MAX_DEPTH, --max_depth MAX_DEPTH

Max number of normal pairs to plot

-t SV_TYPE, --sv_type SV_TYPE

SV type. If omitted, plot is created without variant

bar

-T TRANSCRIPT_FILE, --transcript_file TRANSCRIPT_FILE

GFF3 of transcripts

--transcript_filename TRANSCRIPT_FILENAME

Name for transcript track

--max_coverage_points MAX_COVERAGE_POINTS

number of points to plot in coverage axis (downsampled

from region size for speed)

-A ANNOTATION_FILES [ANNOTATION_FILES ...], --annotation_files ANNOTATION_FILES [ANNOTATION_FILES ...]

Space-delimited list of bed.gz tabixed files of

annotations (such as repeats, mappability, etc.)

--annotation_filenames ANNOTATION_FILENAMES [ANNOTATION_FILENAMES ...]

Space-delimited list of names for the tracks in

--annotation_files

--coverage_tracktype {stack,superimpose,none}

type of track to use for low MAPQ coverage plot.

-a, --print_args Print commandline arguments

-H PLOT_HEIGHT, --plot_height PLOT_HEIGHT

Plot height

-W PLOT_WIDTH, --plot_width PLOT_WIDTH

Plot width

-q INCLUDE_MQUAL, --include_mqual INCLUDE_MQUAL

Min mapping quality of reads to be included in plot

(default 1)

--separate_mqual SEPARATE_MQUAL

coverage from reads with MAPQ <= separate_mqual

plotted in lighter grey. To disable, pass in negative

value

-j, --json_only Create only the json file, not the image plot

--start_ci START_CI confidence intervals of SV first breakpoint (distance

from the breakpoint). Must be a comma-separated pair

of ints (i.e. 20,40)

--end_ci END_CI confidence intervals of SV end breakpoint (distance

from the breakpoint). Must be a comma-separated pair

of ints (i.e. 20,40)

--long_read LONG_READ

Min length of a read to be treated as a long-read

(default 1000)

--ignore_hp Choose to ignore HP tag in alignment files

--min_event_size MIN_EVENT_SIZE

Min size of an event in long-read CIGAR to include

(default 20)

--xaxis_label_fontsize XAXIS_LABEL_FONTSIZE

Font size for X-axis labels (default 6)

--yaxis_label_fontsize YAXIS_LABEL_FONTSIZE

Font size for Y-axis labels (default 6)

--legend_fontsize LEGEND_FONTSIZE

Font size for legend labels (default 6)

--annotation_fontsize ANNOTATION_FONTSIZE

Font size for annotation labels (default 6)

--common_insert_size Set common insert size for all plots

--hide_annotation_labels

Hide the label (fourth column text) from annotation

files, useful for regions with many annotations

--coverage_only Hide all reads and show only coverage

--same_yaxis_scales Set the scales of the Y axes to the max of all

--marker_size MARKER_SIZE

Size of marks on pairs and splits (default 3)

--dpi DPI Dots per inches (pixel count, default 300)

--zoom ZOOM Only show +- zoom amount around breakpoints, much

faster for large regions. Ignored if region smaller

than --zoom (default 500000)

--debug DEBUG Print debug statements

> samplot vcf -h

$ samplot vcf -h

usage: samplot vcf [-h] [--vcf VCF] [-d OUT_DIR] [--ped PED] [--dn_only]

[--min_call_rate MIN_CALL_RATE] [--filter FILTER]

[-O {png,pdf,eps,jpg}] [--max_hets MAX_HETS]

[--min_entries MIN_ENTRIES] [--max_entries MAX_ENTRIES]

[--max_mb MAX_MB] [--min_bp MIN_BP]

[--important_regions IMPORTANT_REGIONS] -b BAMS [BAMS ...]

[--sample_ids SAMPLE_IDS [SAMPLE_IDS ...]]

[--command_file COMMAND_FILE] [--format FORMAT] [--gff GFF]

[--downsample DOWNSAMPLE] [--manual_run] [--debug]

optional arguments:

-h, --help show this help message and exit

--vcf VCF, -v VCF VCF file containing structural variants (default:

None)

-d OUT_DIR, --out-dir OUT_DIR

path to write output PNGs (default: samplot-out)

--ped PED path ped (or .fam) file (default: None)

--dn_only plots only putative de novo variants (PED file

required) (default: False)

--min_call_rate MIN_CALL_RATE

only plot variants with at least this call-rate

(default: 0.95)

--filter FILTER simple filter that samples must meet. Join multiple

filters with '&' and specify --filter multiple times

for 'or' e.g. DHFFC < 0.7 & SVTYPE = 'DEL' (default:

[])

-O {png,pdf,eps,jpg}, --output_type {png,pdf,eps,jpg}

type of output figure (default: png)

--max_hets MAX_HETS only plot variants with at most this many

heterozygotes (default: None)

--min_entries MIN_ENTRIES

try to include homref samples as controls to get this

many samples in plot (default: 6)

--max_entries MAX_ENTRIES

only plot at most this many heterozygotes (default:

10)

--max_mb MAX_MB skip variants longer than this many megabases

(default: None)

--min_bp MIN_BP skip variants shorter than this many bases (default:

20)

--important_regions IMPORTANT_REGIONS

only report variants that overlap regions in this bed

file (default: None)

-b BAMS [BAMS ...], --bams BAMS [BAMS ...]

Space-delimited list of BAM/CRAM file names (default:

None)

--sample_ids SAMPLE_IDS [SAMPLE_IDS ...]

Space-delimited list of sample IDs, must have same

order as BAM/CRAM file names. BAM RG tag required if

this is omitted. (default: None)

--command_file COMMAND_FILE

store commands in this file. (default:

samplot_vcf_cmds.tmp)

--format FORMAT comma separated list of FORMAT fields to include in

sample plot title (default: AS,AP,DHFFC)

--gff GFF genomic regions (.gff with .tbi in same directory)

used when building HTML table and table filters

(default: None)

--downsample DOWNSAMPLE

Number of normal reads/pairs to plot (default: 1)

--manual_run don't auto-run the samplot plot commands (command_file

will be deleted) (default: False)

--debug prints out the reason each skipped variant entry is

skipped (default: False)

ラン

まずサンプルデータのSVを描画してみる。data/に３つのbamと3つのCRAMが入っている（NA12878、NA12889、NA12890の一部の領域（家系はこちらを参照））。このbamのchr4:115928726-115931880のカバレッジを描画する。

samplot/src/samplot.py \ -n NA12878,NA12889,NA12890 \
 -b  samplot/test/data/NA12878_restricted.bam,samplot/test/data/NA12889_restricted.bam,samplot/test/data/NA12890_restricted.bam \
 -o 4_115928726_115931880.png \
 -c chr4 \
 -s 115928726 \
 -e 115931880 \
 -t DEL

-b Bam file names (CSV)
-o Output file name
-c Chromosome range
-s Start range
-e End range
-t SV type

3つのbamを-bフラグの後に","でつないで指定している。

出力の4_115928726_115931880.pngは下のような画像になる。

f:id:kazumaxneo:20180428170437p:plain

カバレッジの分布がプロットされその上に、SVの信号を拾ったペアエンドのリードが線で表示されている。縦軸はインサートサイズも表しており、真ん中のbamはインサートサイズが非常に小さいペアエンド（=> deletionを示唆）がたくさんあるのがわかる。フォントのレンダリングは問題ないが、肝心のグラフのプロットの輪郭が眠いように見える。公式でもぼやけているように見えるので、このような仕様なのかもしれない。"-d 100"をつけると、100だけリードをランダムサンプリングして描画する。全リードを使うより早く実行できる。

vcfファイルの全コールを描画するならsamplot_vcf.shを使う。使うには、bcgtoolsとsamplot.pyのパスを指定する必要がある。

samplot/src/samplot_vcf.sh -B /user/local/bin/bcftools -S samplot/src/samplot.py -v input.vcf input.bam

INFO/SVTYPEが定義されているバリアントのみ描画対象となる。順番にJSONファイルと絵（.png）が出力されてくる。

公式に、遺伝子領域とリピートマスクされた領域をバーで表示する流れが記載されている。ここでもまとめておく。

まず遺伝子のアノテーションをダウンロードしてbedtoolsでポジションソートする。samtoolsのbgzipとbedtools sort（リンク）を使っている。

wget ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.gff3.gz


bedtools sort -i Homo_sapiens.GRCh37.82.gff3.gz | bgzip -c > Homo_sapiens.GRCh37.82.sort.gff3.gz
tabix Homo_sapiens.GRCh37.82.sort.gff3.gz

ゲノムのアノテーションとrepeatmaskerのリピート領域テキストファイルをダウンロードして、それぞれbed.gzに変換する。

wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig
bigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bed
bgzip wgEncodeDukeMapabilityUniqueness35bp.bed
tabix wgEncodeDukeMapabilityUniqueness35bp.bed.gz

#repeatmasker track
curl http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz \
 | bgzip -d -c \
 | cut -f 6,7,8,13 \
 | bedtools sort -i stdin \
 | bgzip -c > rmsk.bed.gz

-Tで遺伝子のnogff3を指定し、-Aでリピートtrackとbedを指定する。

samplot/src/samplot.py \
 -n NA12878,NA12889,NA12890 \
 -b samplot/test/data/NA12878_restricted.bam,samplot/test/data/NA12889_restricted.bam,samplot/test/data/NA12890_restricted.bam \
 -o 4_115928726_115931880.d100.genes_reps_map.png \
 -c chr4 \ 
 -s 115928726 \
 -e 115931880 \ 
 -t DEL \
 -d 100 \
 -T Homo_sapiens.GRCh37.82.sort.gff3.gz \ 
 -A rmsk.bed.gz,wgEncodeDukeMapabilityUniqueness35bp.bed.gz

出力

f:id:kazumaxneo:20180428160020j:plain

一番下に遺伝子、その上にリピート領域としてマスクされた部位が示されている。

こちらのツールでも用いられています。

https://github.com/jbelyeu/SV-plaudit

BAMだけ出なく圧縮率の高いCRAMにも対応しているのは個人的にも助かります。機械学習パッケージについては査読前論文とレポジトリのREADMEを読んでください。

引用

GitHub - ryanlayer/samplot: Plot structural variant signals from many BAMs and CRAMs

追記

Samplot: A Platform for Structural Variant Visual Validation and Automated Filtering

Jonathan R Belyeu, Murad Chowdhury, Joseph Brown, Brent S Pedersen, Michael J Cormier, Aaron Quinlan, Ryan M Layer

bioRxiv, Posted September 25, 2020

2021 5/27

Samplot: a platform for structural variant visual validation and automated filtering

Jonathan R. Belyeu, Murad Chowdhury, Joseph Brown, Brent S. Pedersen, Michael J. Cormier, Aaron R. Quinlan & Ryan M. Layer
Genome Biology volume 22, Article number: 161 (2021)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

構造多型部位のマッピング状況を出力する samplot