複雑なデータをヒートマップで可視化するためのRパッケージ Superheat

2021 08/11 データのロード追記

　テクノロジーの進歩により、科学分野をはじめとする膨大な量のデータを収集することが可能になった。従来のデータ可視化ツールは、高次元環境ではうまく機能しないため、このような膨大なデータセットから有用な情報を抽出することは、継続的な課題となっている。既存の可視化技術の中で、特に大規模データの可視化に適しているのがヒートマップである。ヒートマップは、バイオインフォマティクスなどの分野で非常に人気があるが、現代のデータ解析においてはまだ十分に活用されていない。本論文では，複雑なデータセットを可視化するための極めて柔軟でカスタマイズ可能なプラットフォームを提供する新しいRパッケージであるsuperheatを紹介する。Superheatは、魅力的で拡張可能なヒートマップを作成し、ユーザーは応答変数を散布図として、モデル結果を箱ひげ図として、相関情報を棒ひげ図として追加することができる。この論文の目的は2つある。(1）様々な種類のデータに対する中核的な可視化手法としてのヒートマップの可能性を示すこと、（2）美しく拡張可能なヒートマップを作成するためのsuperheat Rパッケージのカスタマイズ性と実装の容易さを強調することである。Superheatパッケージの機能と基本的な適用性を、一般に公開されているデータソースに基づいた3つの再現可能なケーススタディを通じて探求する。

Vignette

https://rlbarter.github.io/superheat/

インストール

Github

install.packages("devtools")
library("devtools")
devtools::install_github("rlbarter/superheat")

実行方法

library(superheat)

データを読み込む。コピペ読み込み。

#mac
x <- read.table(pipe("pbpaste"), header = TRUE)

#windows
x <- read.table("clipboard"), header = TRUE)

またはファイルから読み込む。

x <- read.table("input.tsv", header=T, sep="\t")

#ファイルがカレントにないならフルパス指定. セパレータがカンマならsep=","
x <- read.table("/home/data/input.csv", header=T, sep=",")

ここではmtcarsを使用。フォントサイズ以外デフォルト設定のヒートマップ。

superheat(mtcars,
 # change the size of the labels
 left.label.size = 0.4, bottom.label.size = 0.1)

f:id:kazumaxneo:20210809205057p:plain

行や列の順序を指定したい場合は， order.rows および order.cols 引数に order ベクトルを指定する。

superheat(mtcars,
 # order the rows by miles per gallon
 order.rows = order(mtcars$mpg),
 # scale the matrix columns
 scale = TRUE)

カラーパレットを内蔵されている他の色系統に変更する。赤だと heat.col.scheme = "red"と指定する。

superheat(mtcars,
 scale = TRUE,
 # change the color
 heat.col.scheme = "red")

f:id:kazumaxneo:20210809210104p:plain

heat.col.scheme = "green"

f:id:kazumaxneo:20210809210121p:plain
heat.col.scheme = "blue"

f:id:kazumaxneo:20210809210201p:plain

色を直接指定する。

superheat(mtcars,
 scale = TRUE,
 heat.pal = c("#b35806", "white", "#542788"))

f:id:kazumaxneo:20210809210559p:plain

ユーザーが指定したパレットの色数に応じて、適切な割合で色が遷移する。特定の場所で強制的に遷移させるには、引数 heat.pal.valuesを使用する（詳細はマニュアル参照）。

カラーの最小値と最大値はheat.lim引数で指定することができる。

superheat(mtcars,
 scale = TRUE,
 heat.lim = c(-1, 2))

指定した範囲外の値は、欠損としてグレーで表示される。

f:id:kazumaxneo:20210809211026p:plain

範囲外の値に限らず、欠損値はグレー表示される。欠損値の色を白に変更するにはheat.na.col = "white"と付ける）

heat.limの範囲外の値をNAではなく飽和した色として表示したい場合は、引数extreme.values.na = FALSEを指定する。

superheat(mtcars,
 scale = T,
 heat.lim = c(-1, 2),
 extreme.values.na = FALSE)

f:id:kazumaxneo:20210809211221p:plain

階層型クラスタリングするには、pretty.order.rows = TRUE および pretty.order.cols = TRUE を指定する。デンドログラムを表示するにはrow.dendrogram = TRUEを付ける。

superheat(mtcars,
 # retain original order of rows/cols
 pretty.order.rows = TRUE, pretty.order.cols = TRUE,
 # scale the matrix columns
 scale = T,
 row.dendrogram = TRUE)

（行列に欠損値がある場合、エラーが発生する可能性あり）

f:id:kazumaxneo:20210809212625p:plain

行や列をあらかじめ指定された数のクラスターにグループ化する。例えば、n.cluster.rows = 3を指定すれば、kmeansで行を3つのグループに分けられる。left.label = 'variable'を消すと、行のラベルがクラスタの番号になる。

set.seed(2016113)
superheat(mtcars,
 scale = T,
 n.clusters.rows = 3,
 left.label = 'variable')

f:id:kazumaxneo:20210809212519p:plain

membership.rows = gearsと付ければ、gearsに対してクラスタリングされる。

タイトルを付ける。title.alignment = "left"なら左揃えになる。

superheat(mtcars,
 scale = T,
 title = "Superheat for mtcars",  title.size = 8, title.alignment = "left")

f:id:kazumaxneo:20210809213036p:plain

行と列のタイトルもつける。

superheat(mtcars,
 scale = T,
 # row title
 row.title = "Cars", row.title.size = 6,
 # col title
 column.title = "Variables", column.title.size = 6)

f:id:kazumaxneo:20210809214927p:plain

ヒートマップに隣接する散布図を追加するには、yt（列の隣）とyr（行の隣）を使う。 yrとytは、行/列の数と同じ（下の例）長さか、または行クラスター数／列クラスター数（散布図、棒グラフ、箱ひげ図のみ）のいずれかと同じ長さでなければならない。

dplyrパッケージのdplyr::select関数でmpg列を除いたデータをヒートマップで視覚化し（dplyr::select(mtcars, -mpg)）、除いたmpg列を行の隣で散布図にする。

superheat(dplyr::select(mtcars, -mpg), 
 scale = T,
 # add mpg as a scatterplot next to the rows
 yr = mtcars$mpg, yr.axis.name = "miles per gallon", yr.point.size = 2)

f:id:kazumaxneo:20210809215450p:plain

ヒートマップに隣接するラインプロットを追加する。 yr.plot.type = "line"を使う。

superheat(dplyr::select(mtcars, -mpg), 
 scale = T,
 # add mpg as a line plot next to the rows
 yr = mtcars$mpg, yr.axis.name = "miles per gallon", yr.plot.type = "line",
 yr.line.col = "springgreen4",  yr.line.size = 1,
 # order the rows by mpg
 order.rows = order(mtcars$mpg))

f:id:kazumaxneo:20210809221627p:plain

bar plotに変更する。

superheat(dplyr::select(mtcars, -mpg), 
 scale = T,
 # add mpg as a line plot next to the rows
 yr = mtcars$mpg, yr.axis.name = "miles per gallon", yr.plot.type = "bar",
 yr.line.col = "springgreen4",  yr.line.size = 1,
 # order the rows by mpg
 order.rows = order(mtcars$mpg))

f:id:kazumaxneo:20210809222945p:plain ヒートマップ横の散布図やラインプロットのデータでクラスタリングする例はVignette参照。

追記 boxplot はクラスタリングが必要。

superheat(dplyr::select(mtcars, -mpg), 
 scale = T,
 
 # cluster the rows
 membership.rows = paste(mtcars$gear, "gears"),
 left.label = "variable",
 
 # add mpg as a boxplot next to the rows
 yr = mtcars$mpg,
 yr.axis.name = "miles per gallon",
 yr.plot.type = "boxplot",

 # change box color
 yr.cluster.col = c("plum4", "paleturquoise4", "salmon3"),
 # order the rows by mpg
 order.rows = order(mtcars$mpg))

f:id:kazumaxneo:20210809223243p:plain

テキストを追加するにはX.text引数を使う。

superheat(X = mtcars, # heatmap matrix
 scale = T,
 # add text matrix
 X.text = round(as.matrix(mtcars), 1), X.text.size = 4, X.text.col = "white")

f:id:kazumaxneo:20210809222154p:plain

Vignetteでは、非常に大きな行列をプロットする場合、見えなくなることを防ぐために、クラスタ内の値を中央値でまとめるスムージング機能（smooth.heat引数を使用）についての説明もあります。また、superheat関数によって生成されたクラスタを抽出する流れなどについても説明されています。アクセスしてみて下さい。

引用
Superheat: An R package for creating beautiful and extendable heatmaps for visualizing complex data

Rebecca L Barter, Bin Yu

J Comput Graph Stat. 2018;27(4):910-922

参考