スプライシングバリエーションを視覚化する sashimi.py

　クロマチンの発現、タンパク質-DNA/RNA相互作用、アクセス性、構造などが条件や細胞種によってどのように異なるかを同時に可視化することにより、オルタナティブスプライシングの制御機構や機能的影響について理解を深めることができる。しかし、既存のSashimiプロット作成ツールは、柔軟性に欠け、複雑で、複数のバイオインフォマティクスフォーマットや様々なゲノミクスアッセイからのデータソースを統合するには使い勝手が悪いままである。そのため、より拡張性のある可視化ツールが必要とされている。ここでは、プログラマブルでインタラクティブなWebベースのアプローチにより、出版品質の可視化を生成するPythonパッケージであるsashimi.pyを紹介する。Sashimi.pyは、シングルセルRNA-seq、タンパク質-DNA/RNA相互作用、ロングリードシーケンスデータ、Hi-Cデータなどの多種多様なデータソースのゲノムデータを前処理なしで視覚的に解釈するプラットフォームで、主要ジャーナルの要求を満たす出力ファイルのフォーマットにも幅広い柔軟性を備えている。Sashimi.pyパッケージは、Bioconda (https://anaconda.org/bioconda/sashimi-py), Docker, PyPI (https://pypi.org/project/sashimi.py/), GitHub (https://github.com/ygidtu/sashimi.py) で自由に利用できるオープンソースソフトウェアであり、ローカル展開用の組み込みWebサーバーも提供されている。

Features of sashimi.py（Githubより）

Support various file formats as input
Support strand-aware coverage plot
Visualize coverage by heatmap, including HiC diagram
Visualize protein domain based the given gene id
Demultiplex the single-cell RNA/ATAC-seq which used cell barcode into cell population
Support visualizing individual full-length reads in read-by-read style
Support visualize circRNA sequencing data

インストール

mambaでpython3.10の環境を作ってテストした(ubuntu18使用)。

Github

#conda(bioconda)
mmaba create -n sashimi
conda activate sashimi
mamba install -c conda-froge -c bioconda sashimi-py -y

#pip(pypi)
pip install sashimi.py

#docker(dockerhub)
docker pull ygidtu/sashimi
docker run --rm ygidtu/sashimi --help

#installing local web server
git clone https://github.com/ygidtu/sashimi.py sashimi
cd sashimi/web
# build the frontend static files
npm install -g vue-cli vite && npm install
vite build

# prepare the backend server
pip install fastapi pydantic jinja2 uvicorn
python server.py --help

> sashimipy -h

Usage: sashimipy [OPTIONS]

Try 'sashimipy -h' for help.

Error: Missing option '-e' / '--event'.

(sasami) kazu@kazu:~$ sashimipy -h

Usage: sashimipy [OPTIONS]

Welcome to use sashimi

Options:

--version Show the version and exit.

--debug enable debug level log

-e, --event TEXT Event range eg: chr1:100-200:+ [required]

Common input files configuration:

--color-factor INTEGER RANGE Index of column with color levels (1-based);

NOTE: LUAD|red -> LUAD while be labeled in

plots and red while be the fill color

[default: 1; x>=1]

--barcode PATH Path to barcode list file, At list two

columns were required, - 1st The name of bam

file;- 2nd the barcode;- 3rd The group

label, optional;- 4th The color of each

cell type, default using the color of

corresponding bam file.

--barcode-tag TEXT The default cell barcode tag label

[default: CB]

--umi-tag TEXT The default UMI barcode tag label [default:

UB]

-p, --process INTEGER RANGE How many cpu to use [1<=x<=128]

--group-by-cell Group by cell types in density/line plot

--remove-duplicate-umi Drop duplicated UMIs by barcode

Output settings:

-o, --output PATH Path to output graph file

-d, --dpi INTEGER RANGE The resolution of output file [default:

300; x>=1]

--raster The would convert heatmap and site plot to

raster image (speed up rendering and produce

smaller files), only affects pdf, svg and PS

--height FLOAT The height of output file, default adjust

image height by content [default: 1]

--width INTEGER RANGE The width of output file, default adjust

image width by content [default: 10; x>=0]

--backend TEXT Recommended backend [default: Agg]

Reference settings:

-r, --reference PATH Path to gtf file, both transcript and exon

tags are necessary

--interval PATH Path to list of interval files in bed

format, 1st column is path to file, 2nd

column is the label [optional]

--show-id Whether show gene id or gene name

--show-exon-id Whether show gene id or gene name

--no-gene Do not show gene id next to transcript id

--domain Add domain information into reference track

--proxy TEXT The http or https proxy for EBI/Uniprot

requests,if `--domain` is True, eg:

http://127.0.0.1:1080

--timeout INTEGER RANGE The requests timeout when `--domain` is

True. [default: 10; x>=1]

--local-domain TEXT Load local domain folder and load into

reference track, download from https://hgdow

nload.soe.ucsc.edu/gbdb/hg38/uniprot/

--remove-empty Whether to plot empty transcript

--transcripts-to-show TEXT Which transcript to show, transcript name or

id in gtf file, eg: transcript1,transcript2

--choose-primary Whether choose primary transcript to plot.

--ref-color TEXT The color of exons [default: black]

--intron-scale FLOAT The scale of intron [default: 0.5]

--exon-scale FLOAT The scale of exon [default: 1]

Density plot settings:

--density PATH The path to list of input files, a tab

separated text file, - 1st column is path

to input file, - 2nd column is the file

category, - 3rd column is input file alias

(optional), - 4th column is color of input

files (optional), - 5th column is the

library of input file (optional, only

required by bam file).

--customized-junction TEXT Path to junction table column name needs to

be bam name or bam alias.

--only-customized-junction Only used customized junctions.

-t, --threshold INTEGER RANGE

Threshold to filter low abundance junctions

[default: 0; x>=0]

--density-by-strand Whether to draw density plot by strand

--show-site Whether to draw additional site plot

--site-strand [all|+|-] Which strand kept for site plot, default use

all [default: all]

--included-junctions TEXT The junction id for including, chr1:1-100

--show-junction-num Whether to show the number of junctions

--sc-density-height-ratio FLOAT

The relative height of single cell density

plots [default: 1]

Line plot settings:

--line PATH The path to list of input files, a tab

separated text file, - 1st column is path

to input file, - 2nd column is the file

category, - 3rd column is input file group

(optional), - 4th column is input file

alias (optional), - 5th column is color

platte of corresponding group (optional).

--hide-legend Whether to hide legend

--legend-position TEXT The legend position

--legend-ncol INTEGER RANGE The number of columns of legend [x>=0]

Heatmap plot settings:

--heatmap PATH The path to list of input files, a tab

separated text file, - 1st column is path

to input file, - 2nd column is the file

category, - 3rd column is input file group

(optional), - 4th column is color platte

of corresponding group.

--clustering Enable clustering of the heatmap

The clustering method for heatmap [default:

ward]

--distance-metric

The distance metric for heatmap [default:

euclidean]

--heatmap-scale Do scale on heatmap matrix.

--heatmap-vmin INTEGER Minimum value to anchor the colormap,

otherwise they are inferred from the data.

--heatmap-vmax INTEGER Maximum value to anchor the colormap,

otherwise they are inferred from the data.

--show-row-names Show row names of heatmap

--sc-heatmap-height-ratio FLOAT

The relative height of single cell heatmap

plots [default: 0.2]

IGV settings:

--igv PATH The path to list of input files, a tab

separated text file, - 1st column is path

to input file, - 2nd column is the file

category, - 3rd column is input file alias

(optional), - 4th column is color of input

files (optional) - 5th column is exon_id

for sorting the reads (optional).

--m6a TEXT Sashimi.py will load location information

from the given tags and then highlight

the RNA m6a modification cite at individual

reads. If there are multiple m6a

modification site, please add tag as follow,

234423,234450

--polya TEXT Sashimi.py will load length of poly(A) from

the given tags and then visualize the

poly(A) part at end of each individual

reads.

--rs TEXT Sashimi.py will load real strand information

of each reads from the given tags and the

strand information is necessary for

visualizing poly(A) part.

--del-ratio-ignore FLOAT RANGE

Ignore the deletion gap in nanopore or

pacbio reads. if a deletion region was

smaller than (alignment length) *

(del_ratio_ignore), then the deletion gap

will be filled. currently the

del_ratio_ignore was 1.0. [0.0<=x<=1.0]

HiC settings:

--hic PATH The path to list of input files, a tab

separated text file, - 1st column is path

to input file, - 2nd column is the file

category, - 3rd column is input file alias

(optional), - 4th column is color of input

files (optional) - 5th column is data

transform for HiC matrix, eg log1p, log2,

log10 (optional).

Additional annotation:

-f, --genome PATH Path to genome fasta

--sites TEXT Where to plot additional indicator lines,

comma separated int

--stroke TEXT The stroke regions:

start1-end1:start2-end2@color-label, draw a

stroke line at bottom, default color is red

--link TEXT The link: start1-end1:start2-end2@color,

draw a link between two site at bottom,

default color is blue

--focus TEXT The highlight regions: 100-200:300-400

Motif settings:

--motif PATH The path to customized bedGraph file, first

three columns is chrom, start and end site,

the following 4 columns is the weight of

ATCG.

--motif-region TEXT The region of motif to plot in start-end

format

--motif-width FLOAT The width of ATCG characters [default: 0.8]

Layout settings:

--n-y-ticks INTEGER RANGE The number of ticks of y-axis [x>=0]

--distance-ratio FLOAT distance between transcript label and

transcript line [default: 0.1]

--reference-scale FLOAT The size of reference plot in final plot

[default: 0.25]

--stroke-scale FLOAT The size of stroke plot in final image

[default: 0.25]

Overall settings:

--font-size INTEGER RANGE The font size of x, y-axis and so on [x>=1]

--reverse-minus Whether to reverse strand of bam/reference

file

--hide-y-label Whether hide y-axis label

--same-y Whether different sashimi/line plots shared

same y-axis boundaries

--log [0|2|10|zscore] y axis log transformed, 0 -> not log

transform; 2 -> log2; 10 -> log10

--title TEXT Title

--font TEXT Fonts

-h, --help Show this message and exit.

テストラン

sashimipyはBAM、Bed、bigBed、bigWig、Depth file generated by samtools depth
、naive Hi-C formatをサポートしている。Githubでは利用可能な大半のオプションをつけてランした例が掲載されている。

git clone https://github.com/ygidtu/sashimi.py sashimi
cd sashimi/
python main.py \
  -e chr1:1270656-1284730:+ \
  -r example/example.sorted.gtf.gz \
  --interval example/interval_list.tsv \
  --density example/density_list.tsv \
  --show-site \
  --igv example/igv.tsv \
  --heatmap example/heatmap_list.tsv \
  --focus 1272656-1272656:1275656-1277656 \
  --stroke 1275656-1277656:1277856-1278656@blue \
  --sites 1271656,1271656,1272656 \
  --line example/line_list.tsv \
  -o example/example.png \
  --dpi 300 \
  --width 10 \
  --height 1 \
  --barcode example/barcode_list.tsv \
  --domain --remove-duplicate-umi