RNA seqのシミュレータ polyester - macでインフォマティクス

　RNA-seq実験は遺伝子発現を研究する手段としてますます普及が進んでいる。RNA-seqデータ（Oshlack et al、2010）の発現解析のための様々な統計的手法がある。 RNA-seqの統計的方法論の開発者は、ツールが正しく機能しているかどうかをテストする必要がある。真の遺伝子発現レベルおよび集団間の発現差異は通常不明であり、スパイクイン実験は時間と経費の両方でコストがかかるため、実際のデータセットで精度試験を行うことはできない。

　代わりに、研究者は計算シミュレーションを使用して、既知の信号とノイズ構造を持つデータセットを作成することがよくある。典型的には、発現変動分析ツールを評価するために使用されるシミュレートの発現は、一般的な発現変動分析ツール（AndersおよびHuber、2010; Robinson et al、2010）において使用される統計モデルから遺伝子カウントとして生成される。しかし、これらのシミュレートされたシナリオは、シーケンスリードまたはリードカウントのようなRNA-seqデータ分析の上流工程中に生じる発現の変動性を説明していない。polyesterは、RNA-seqのリードをシミュレートするための新しいRパッケージである。polyesterの主な利点は、ユーザーが、遺伝子またはアイソフォームの特定の発現変動シグナルを用いてシークエンシングリードをシミュレートできることである。これにより、ユーザーはRNA-seqパイプラインの複数の観点で変動の原因を調べることができる。

　シーケンシングリードを生成する既存のRNA-seqシミュレータは、実際のレプリケート実験および指定された発現変動シグナルを用いた実験のために設計されていない。たとえば、RSEM（Li and Dewey、2011）に同梱されているrsem-simulate-readsユーティリティでは、実際のシークエンシングリードをアライメントさせてモデルを開発するため、シミュレーションに時間がかかる。FluxSimulator（Griebel et al、2012）もRNA-Seqの有効性を評価するためのBenchmarkerであるBEERS（Grant et al、2011）も、発現変動を誘導する組み込みメカニズムを持っていない。これらのシミュレーターは、レプリケート間の生物的変動や特定の転写物の正確な発現レベルを特定するための方法も提供しない。 TuxSimは、微分式（Trapnell et al、2013）を用いてRNA-seqデータセットをシミュレートするために使用されているが、公に利用可能ではない。

　著者らは、レプリケートおよび明確に定義された発現変動でのRNA-seqのリードをシミュレートするためにpolyesterを作製した。ユーザーは、少数の遺伝子または単一の染色体から小さな実験を簡単にシミュレートできる。これによって、計算集約的なステップを実行する必要がある場合に、シミュレーションの計算時間を短縮することができる。

インストール

source("http://bioconductor.org/biocLite.R") 
biocLite("polyester")

ラン

シミュレートにはトランスクリプトのFASTAファイルが必要である。chr20のtranscriptsが用意されている（Users/user_name/Library/R/3.4/library/polyester/extdata/）。これを使ってみる。

ライブラリの読み込み。

library(polyester) 
library(Biostrings)

transcriptsのFASTAファイルの読み込み。

fasta_file = system.file('extdata', 'chr22.fa', package='polyester') #パッケーージに入っているファイルを探すsystem.file関数
fasta = readDNAStringSet(fasta_file) #BiostringsのreadDNAStringSet

> head(fasta,3)

A DNAStringSet instance of length 3

width seq names

[1] 725 GCCACGTGAAGGATGTGTTTGCT...TTAAATTTGATTGAACAATTGTA gi|424037187|ref|...

[2] 746 GCCACGTGAAGGATGTGTTTGCT...TAAAAAAAAAAAAAAAAAAAAAA gi|424037186|ref|...

[3] 2037 GGTAGACGCGATCTGCTGGCTAC...GAACATGATGGGGGGCATGCGTC gi|209977002|ref|...

テストデータは900以上のtranscriptsがあるが、先頭20配列だけ使う。

small_fasta = fasta[1:20]

#ファイル名chr22_small.faで書き出す。このあと使う。
writeXStringSet(small_fasta, 'chr22_small.fa')

全transcripts x20のFPKMとする（20xカバレッジ = transcriptsサイズ / リード長 x 20）

readspertx = round(20 * width(small_fasta) / 100)

matrix関数で行列を20行作成させる。これが次のステップのsimulate_experiment関数に与える各transcriptsのfold_changesの値になる。

fold_changes = matrix(c(4,4,rep(1,18),1,1,4,4,rep(1,16)), nrow=20) #rep(1,18)は1を18回繰り返し
head(fold_changes,20)

> head(fold_changes,20)

[,1] [,2]

[1,] 4 1

[2,] 4 1

[3,] 1 4

[4,] 1 4

[5,] 1 1

[6,] 1 1

[7,] 1 1

[8,] 1 1

[9,] 1 1

[10,] 1 1

[11,] 1 1

[12,] 1 1

[13,] 1 1

[14,] 1 1

[15,] 1 1

[16,] 1 1

[17,] 1 1

[18,] 1 1

[19,] 1 1

[20,] 1 1

20行だと、１列目は4,4,rep(1,18)の20文字。２列目は1,1,4,4,rep(1,16)の20文字。

負の二項分布でシミュレートするsimulate_experiment関数でリードを発生させる（リンク）。サンプル１もサンプル２も3のレプリケートデータを発生させるc(3,3)。

simulate_experiment('chr22_small.fa', reads_per_transcript=readspertx, num_reps=c(3,3), fold_changes=fold_changes, outdir='simulated_reads', paired=FALSE)

reads_per_transcript baseline mean number of reads to simulate from each transcript. Can be an integer, in which case this many reads are simulated from each transcript, or an integer vector whose length matches the number of transcripts in fasta. (Default 300)
fold_changes Vector of multiplicative fold changes between groups, one entry per transcript in fasta. A fold change > 1 means the transcript is overexpressed in the first num_reps (or num_reps[1]) samples. Fold change < 1 means transcript is overexpressed in the last num_reps (or num_reps[2]) samples. The change is in the mean number of reads generated from the transcript, between groups.
num_reps How many biological replicates should be in each group? If num_reps is a single integer, num_reps replicates will be simulated in each group. Otherwise, num_reps can be a length-2 vector, where num_reps[1] and num_reps[2] replicates will be simulated in each of the two groups.
paired If TRUE, paired-end reads are simulated; else single-end reads are simulated.

出力フォルダ。

f:id:kazumaxneo:20180408205902j:plain

ペアエンドならsample_01_1.fasta、sample_01_2.fastaなどの名前になる。

リード数を直接指定する例は、Bioconductorのオンラインマニュアルを確認してください。オンラインマニュアルには、実際の発現データの行列（FPKMまたはRPKM）およびアノテーションを取り込んで、与えられたデータと同様の発現レベルおよび発現変動をシミュレートする関数simulate_experiment_empiricalについても説明されています。

引用

Polyester: simulating RNA-seq datasets with differential transcript expression

Alyssa C. Frazee, Andrew E. Jaffe, Ben Langmead, and Jeffrey T. Leek.

Bioinformatics. 2015 Sep 1; 31(17): 2778–2784.