（ヒト、マウス、ラット）RNA seqの前処理からリードカウントまで行うスケーラブルなパイプライン SPEAQeasy

2021 1/25 わかりくい説明を修正、タイトル修正

　RNAシーケンシング（RNA-seq）は、一般的で広く普及している生物学的アッセイであり、それによって生成されるデータの量は増加している。実際には、生のRNA-seqリードから発現変動遺伝子などの直接的に価値のある情報を得る前に、研究者が実行しなければならない個々のステップが多数存在する。既存のソフトウェアツールは一般的に特化したもので、より大きなワークフローの中の1つのステップ（例えば、リファレンスゲノムへのリードのアラインメントなど）を実行するだけである。より包括的で再現性の高いワークフローへの要求は、多くの一般に公開されているRNA-seqパイプラインの生産につながっている。しかし、そのほとんどは、セットアップや複数のユーザー間での共有に計算の専門知識を必要としたり、積極的にメンテナンスされていなかったり、著者ら自身の分析においても重要であると思われる機能が欠けていたりすることがわかってきた。これらの懸念に対応するために、Scalable Pipeline for Expression Analysis and Quantification (SPEAQeasy)を開発した。これはインストールや共有が簡単で、R/Bioconductorのダウンストリーム解析ソリューションへの橋渡しをしてくれる。SPEAQeasyはユーザーフレンドリーで、生物学者や臨床医がRNA-seqデータ処理を行う際の計算領域への参入障壁を低くする。SPEAQeasyは計算フレームワーク（SGE, SLURM, ローカル, docker integration）間で移植可能であり、異なる設定ファイルが提供されている。

Setup

テストラン

Workflow overview.　Githubより

インストール

依存

SPEAQeasy requires that the following be installed:

Java 8 or later
Python 3 (tested with 3.7.3), with pip

Github

環境に合わせて複数の導入方法が提供されている。ここではローカルマシンで推奨されているdocker imageを利用したインストール手順を選んだ。

git clone https://github.com/LieberInstitute/SPEAQeasy.git
cd SPEAQeasy/
bash install_software.sh "docker"

ローカルマシンのメインスクリプトはrun_pipeline_local.shになる。ローカルマシンでdockerイメージを使ってランする時は、オプション"-profile"の"-profile localとなっているところ"を"-profile docker_local"に変更する。

f:id:kazumaxneo:20210124170018p:plain

（-profile docker_localに変更した）

ローカルマシンでのdockerを使った実行方法

１、テストラン

ローカルマシンのメインスクリプトはbashのrun_pipeline_local.shになる。これを指定する。docker実行権がないならsudoで実行。

bash run_pipeline_local.sh

実際にbash内で実行されているのは以下のnextflow main.nfコマンドである。

ORIG_DIR=/home/kazu/Document/SPEAQeasy

export _JAVA_OPTIONS="-Xms5g -Xmx7g"

$ORIG_DIR/Software/nextflow main.nf \
--small_test \
--sample "single" \
--reference "hg38" \
--strand "unstranded" \
--annotation "$ORIG_DIR/Annotation" \
-with-report execution_reports/pipeline_report.html \
-profile docker_local \
2>&1 | tee -a SPEAQeasy_output.log

--small_test Uses sample files located in the test folder as input. Overrides the “–input” option.
--sample “single” or “paired”: the orientation of your reads
--strand “unstranded”, “forward”, or “reverse”: the strandness of your reads. Since strandness is inferred by sample in the pipeline, this option informs the pipeline to generate appropriate warnings if unexpected strandness is inferred.
--reference “hg38”, “hg19”, “mm10”, or “rn6”: the reference genome to which reads are aligned
--annotation The path to the directory containing pipeline annotations. Defaults to “./Annotation” (relative to the repository). If annotations are not found here, the pipeline includes a step to build them.
-with-report [filename] Include this to produce an html report with execution details (such as memory usage, completion details, and much more)
--input The path to the directory with the “samples.manifest” file. Defaults to “./input” (relative to the repository)

Full Command Referenceについてはこちらを参照。

テストランにはxeon E5 platinumのシングルノードで1時間ほどかかった。

出力

f:id:kazumaxneo:20210124180443p:plain

リードカウントの他、バリアントコールのVCFも出力される。出力内容については論文のTable S4が詳しい。

ラン後、すぐに発現変動遺伝子群の解析を始めることができる。iDEPなどと組み合わせれば、ほとんどコマンドをたたく必要もなくなる。

実際のラン

" --small_test"はテスト用のオプションで、実際に解析に使用する場合はこのフラグを消して"--input"を立て、fastqのディレクトリを指定する。

テストディレクトリを確認してみる。

test/human/paired/forward/

f:id:kazumaxneo:20210124171850p:plain

認識されるのはディレクトリ中のsamples.manifestファイルになる（このファイルに記載されたfastqのパスと構成が読み取られる）samples.manifestファイルに記載する内容については論文のTable.1に説明があり、タブ区切りファイルで、１列目：forward fastq のパス、２列目：forward fastq のmd5、３列目：reverse fastq のパス、４列目：reverse fastq のmd5、５列目：sample IDを記載する。

リファレンスゲノムは--referenceと--annotation で指定する。デフォルトのコードは --annotation "$ORIG_DIR/Annotation"となっており、SPEAQeasy/Annotation/reference/hg38/を指定している。

SPEAQeasy/Annotation以下は以下のようなディレクトリ構成となっている。

f:id:kazumaxneo:20210125191216p:plain

hg38/の中にassemblyディレクトリとtranscriptsディレクトリがあり、そのそれぞれのディレクトリにgenomeとtranscriptsのfastaファイルを収めるfaディレクトリおよびindexを収めるfa or transcriptsディレクトリがある。assemblyディレクトリの方はindex/で、その中にhisat2のindexを置く。transcriptsディレクトリはkallisto/で、その中にkallistoのindexを置く。

--reference "xxx"のオプションで対応しているのはhg38, hg19, mm10, rn6となる。指定したリファレンスのファイルは自動でダウンロードされるので、ユーザーが用意する必要はない。

発現変動遺伝子の解析の実際の流れについてもチュートリアルがあります。

Bootcamp (R markdownも用意されている)

引用

SPEAQeasy: a Scalable Pipeline for Expression Analysis and Quantification for R/Bioconductor-powered RNA-seq analyses

Nicholas J. Eagles, Emily E. Burke, Jacob Leonard, Brianna K. Barry, Joshua M. Stolz, Louise Huuki, BaDoi N. Phan, Violeta Larios Serrato, Everardo Gutiérrez-Millán, Israel Aguilar-Ordoñez, Andrew E. Jaffe, Leonardo Collado-Torres

bioRxiv, Posted December 11, 2020