発現変動遺伝子を同定するTrinityのrun_DE_analysis.plスクリプト

Trinityに付属するスクリプトrun_DE_analysis.plを使うと、BioconductorのRパッケージを使って発現変動遺伝子群を同定して分析することができる。Trinityのabundance_estimates_to_matrix.plなどを使って得た発現行列ファイルを使う。

手順はTrinityのマニュアルに従っている。

インストール

ubuntu18.04でtrinityの仮想環境を作ってテストした。Rのバージョンは4.1.1。

依存

R
edgeR, limma, DESeq2, ctc, Biobase, gplots, ape, argparse

Github

mamba create -n trinity python=3.8
conda activate trinity
mamba install -c bioconda -y trinity

DEG解析に使うパッケージは前もって導入しておく
install.packages("BiocManager")
library(BiocManager)
BiocManager::install(c("edgeR", "limma", "DESeq2", "ctc", "Biobase", "gplots", "ape", "argparse"))

> run_DE_analysis.pl

$ run_DE_analysis.pl

#################################################################################################

# Required:

# --matrix|m <string> matrix of raw read counts (not normalized!)

# --method <string> edgeR|DESeq2|voom

# note: you should have biological replicates.

# edgeR will support having no bio replicates with

# a fixed dispersion setting.

# Optional:

# --samples_file|s <string> tab-delimited text file indicating biological replicate relationships.

# ex.

# cond_A cond_A_rep1

# cond_A cond_A_rep2

# cond_B cond_B_rep1

# cond_B cond_B_rep2

# General options:

# --min_reps_min_cpm <string> default: 2,1 (format: 'min_reps,min_cpm')

# At least min count of replicates must have cpm values > min cpm value.

# (ie. filtMatrix = matrix[rowSums(cpm(matrix)> min_cpm) >= min_reps, ] adapted from edgeR manual)

# Note, ** if no --samples_file, default for min_reps is set = 1 **

# --output|o name of directory to place outputs (default: $method.$pid.dir)

# --reference_sample <string> name of a sample to which all other samples should be compared.

# (default is doing all pairwise-comparisons among samples)

# --contrasts <string> file (tab-delimited) containing the pairs of sample comparisons to perform.

# ex.

# cond_A cond_B

# cond_Y cond_Z

###############################################################################################

# ## EdgeR-related parameters

# ## (no biological replicates)

# --dispersion <float> edgeR dispersion value (Read edgeR manual to guide your value choice)

# http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

###############################################################################################

# Documentation and manuals for various DE methods. Please read for more advanced and more

# fine-tuned DE analysis than provided by this helper script.

# edgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

# DESeq2: http://bioconductor.org/packages/release/bioc/html/DESeq2.html

# voom/limma: http://bioconductor.org/packages/release/bioc/html/limma.html

###############################################################################################

> Glimma.Trinity.Rscript-h

Warning message:

package ‘argparse’ was built under R version 4.1.2

usage: Glimma.Trinity.Rscript

[-h] --samples_file SAMPLES_FILE --DE_results DE_RESULTS

--counts_matrix COUNTS_MATRIX

optional arguments:

-h, --help show this help message and exit

--samples_file SAMPLES_FILE

samples file

--DE_results DE_RESULTS

DE_results file

--counts_matrix COUNTS_MATRIX

matrix of raw counts

> analyze_diff_expr.pl

####################################################################################

# Required:

# --matrix|m <string> TMM.EXPR.matrix

# Optional:

# -P <float> p-value cutoff for FDR (default: 0.001)

# -C <float> min abs(log2(a/b)) fold change (default: 2 (meaning 2^(2) or 4-fold).

# --output <float> prefix for output file (default: "diffExpr.P${Pvalue}_C${C})

# Misc:

# --samples|s <string> sample-to-replicate mappings (provided to run_DE_analysis.pl)

# --max_DE_genes_per_comparison <int> extract only up to the top number of DE features within each pairwise comparison.

# This is useful when you have massive numbers of DE features but still want to make

# useful heatmaps and other plots with more manageable numbers of data points.

# --order_columns_by_samples_file instead of clustering samples or replicates hierarchically based on gene expression patterns,

# order columns according to order in the --samples file.

# --max_genes_clust <int> default: 10000 (if more than that, heatmaps are not generated, since too time consuming)

# --examine_GO_enrichment run GO enrichment analysis

# --GO_annots <string> GO annotations file

# --gene_lengths <string> lengths of genes file

# --include_GOplot optional: will generate inputs to GOplot and attempt to make a preliminary pdf plot/report for it.

##############################################################

> define_clusters_by_cutting_tree.pl

###################################################################################

# -K <int> define K clusters via k-means algorithm

# or, cut the hierarchical tree:

# --Ktree <int> cut tree into K clusters

# --Ptree <float> cut tree based on this percent of max(height) of tree

# -R <string> the filename for the store RData (file.all.RData)

# misc:

# --lexical_column_ordering reorder column names according to lexical ordering

# --no_column_reordering

###################################################################################

> trinityrnaseq/Analysis/DifferentialExpression/rename_matrix_feature_identifiers.pl

###############################################

# Usage: trinityrnaseq/Analysis/DifferentialExpression/rename_matrix_feature_identifiers.pl matrix.txt new_feature_id_mapping.txt

# The 'new_feature_id_mapping.txt' file has the format:

# current_identifier <tab> new_identifier

# ....

# Only those entries with new names listed will be updated, the rest stay unchanged.

#################################################

実行方法

１、全サンプルの全リードを結合し、単一のトリニティアセンブリを生成する（各サンプルごとにアセンブルすると比較が困難になる）。

２、昨日紹介した方法でGeneまたはTranscriptsの発現行列のファイルを得る（salmonが一番高速）。（このあと、３以降の前に6をランするとアノテーションを付けられる）

３、run_DE_analysis.plのラン

run_DE_analysis.plのランには、サンプルとの関係を示したリストファイルが必要。そのファイルを"--samples_file <list>"で指定する。run_DE_analysis.plはリストファイルからサンプルのreplicatesを判断してRNAseqのデータを比較する。

リストファイルは、サンプルグループ名<tab>反復名のタブ区切りファイル。

cond_A cond_A_rep1
cond_A cond_A_rep2
cond_B cond_B_rep1
cond_B cond_B_rep2

このリストファイルを指定する。ここではDEseq2を使う。このコマンドで各遺伝子のlog fold change、P値、FDRが計算される。

run_DE_analysis.pl --matrix counts.matrix --samples_file list --method DESeq2 --dispersion 0.1

--method <string> edgeR|DESeq2|voom
note: you should have biological replicates.edgeR will support having no bio replicates with a fixed dispersion setting.
--dispersion <float> edgeR dispersion value (Read edgeR manual to guide your value choice)
--samples_file <string> tab-delimited text file indicating biological replicate relationships.

デフォルトでは、各サンプルグループ間の比較が実行される。ペアごとの比較を制限したい場合は、--contrastsパラメータに実行する比較のリストを指定する（マニュアル参照）。

出力例

f:id:kazumaxneo:20211226192201p:plain

group_A_vs_group_B.voom.DE_results

f:id:kazumaxneo:20211226212956p:plain

geneID, logFC, logCPM, PValue, FDRの5列。このファイルには全遺伝子表示されている。

group_A_vs_group_B.edgeR.DE_results.MA_n_Volcano.pdf

f:id:kazumaxneo:20211226213652p:plain

f:id:kazumaxneo:20211226213704p:plain

FDR <0.05が赤になっている。ここでは正しい複製をつかっていないのでFDR <0.05を満たす遺伝子はほぼない。

４、analyze_diff_expr.plのラン

有意なFDRとfold-changesを持つ転写産物を抽出し、サンプル間の発現差のパターンに従って転写物をクラスタリングする。analyze_diff_expr.plをランには、まずDEseq2|edgeR|limmaの出力ディレクトリ内に入り、以下のように実行する。DEseq2|edgeR|limmaの出力ディレクトリ内にある~.count_matrixファイルを指定する。

cd DEseq2/
analyze_diff_expr.pl --matrix group_A_vs_group_B.DESeq2.count_matrix -P 1e-3 -C 2

P値が最大1e-3、少なくとも2^2倍の発現変動がある遺伝子が抽出される。

出力例

f:id:kazumaxneo:20211226214718p:plain

A_vs_group_B.DESeq2.DE_results.P1e-3_C2.group_A-UP.subsetはグループAで誘導された遺伝子、A_vs_group_B.DESeq2.DE_results.P1e-3_C2.group_B-UP.subsetはグループBで誘導された遺伝子。他にもサンプル間のペアワイズ相関行列ヒートマップが出力される。

Trinotateアノテーションがあれば、４のランと同時にGO enrichmet解析も実行できる（別記事の一番下）。

５、define_clusters_by_cutting_tree.plのラン

より本格的に遺伝子クラスターを定義するために、ヒートマップで示されたDEGsを、類似の発現パターンを持つ遺伝子クラスターに分割する。ランに必要なデータはすべてRセッションで、ローカルに 'all.RData' というファイルとして保存されているので、これをdefine_clusters_by_cutting_tree.plのコマンドで指定する。階層的にクラスタ化された遺伝子ツリーを ”--Ptree %”の高さで切断する。もしくはK-means clustering を用いて。(-Kパラメータを使用)、K 個の遺伝子クラスターセットを定義する。

define_clusters_by_cutting_tree.pl -R diffExpr.P1e-3_C2.matrix.RData --Ptree 60

-R <string> the filename for the store RData (file.all.RData)
--Ptree <float> cut tree based on this percent of max(height) of tree
--Ktree <int> cut tree into K clusters

ランが終わると、DEseq2|edgeR|limmaの出力ディレクトリ内にさらにディレクトリができ、その中にファイルが保存される。

出力例

f:id:kazumaxneo:20211226215248p:plain

各クラスタの発現行列（log2変換、中央値中心）と下の図が保存される。

my_cluster_plots.pdf

f:id:kazumaxneo:20211226215301p:plain

各遺伝子は、そのクラスタの平均発現プロファイル（青）に加えて（灰）プロットされる。ここでのテストには正しくない複製を使っており、灰色の線（DEGs）の数は極端に少ない。

６、Trinotateの機能的アノテーションの追加

Trinotate（Trinityの開発者Brian Haasが開発・管理）をランしてトランスクリプトームの機能的アノテーションを付けた場合、機能アノテーション情報をDEGsのマトリックスに追加してGO enrichment解析を実行したりできる。

まずTrinotate.xls のアノテーションファイルをアノテーションマッピングファイルに変換する。

git clone https://github.com/Trinotate/Trinotate.git
perl Trinotate/util/Trinotate_get_feature_name_encoding_attributes.pl Trinotate.xls > Trinotate_report.xls.name_mappings

Trinotate.xlsが複数あるなら複数回ランして出力をcatで結合する。

Trinotate_report.xls.name_mappings (マニュアル)

f:id:kazumaxneo:20211226223623p:plain

次に、これらの新しい機能アノテーションを発現行列に組み込む。DEGsのリストでも元の発現行列でも組み込める。

perl trinityrnaseq/Analysis/DifferentialExpression/rename_matrix_feature_identifiers.pl \
Trinity_trans.TMM.EXPR.matrix Trinotate_report.xls.name_mappings  > Trinity_trans.TMM.EXPR.annotated.matrix

転写産物（または遺伝子）識別子と一緒に簡単なアノテーション文字列が付けられる。これは、下流のすべての発現解析に反映される。例えば、ヒートマップやMA plotなどで視覚化した時にも表示されるので、前もってランしておいてもよいかもしれない。

Trinityのマニュアルでは、転写産物または遺伝子識別子に機能的な注釈をつけたあと、TM4 MeVアプリケーションに読み込んで、解析する例が書かれています。TM4 MeVアプリケーションは、マイクロアレイやRNA-Seqデータから得られる発現データをナビゲートするためのデスクトップアプリケーションです。使い方についても解説されています（リンク）。

引用

De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis

Brian J Haas, Alexie Papanicolaou, Moran Yassour, Manfred Grabherr, Philip D Blood, Joshua Bowden, Matthew Brian Couger, David Eccles, Bo Li, Matthias Lieber, Matthew D MacManes, Michael Ott, Joshua Orvis, Nathalie Pochet, Francesco Strozzi, Nathan Weeks, Rick Westerman, Thomas William, Colin N Dewey, Robert Henschel, Richard D LeDuc, Nir Friedman , Aviv Regev

Nat Protoc. 2013 Aug;8(8):1494-512