Trinityのabundance_estimates_to_matrix.plスクリプトで発現行列を出力、filter_low_expr_transcripts.plスクリプトで低発現転写産物をフィルタリングする

Trinityに付属するスクリプトabundance_estimates_to_matrix.plは、align_and_estimate_abundance.plの出力を入力として、複数サンプルを（正規化しつつ）統合した発現行列ファイルを生成するスクリプト。Trinityのマニュアルに習い、使い方を確認しておく。

インストール

ubuntu18.04LTSでcondaの仮想環境を作ってテストした。

Github

mamba create -n trinity python=3.8
conda activate trinity
mamba install -c bioconda -y trinity

#salmon
mamba install -c bioconda -y salmon

> abundance_estimates_to_matrix.pl -h

$ abundance_estimates_to_matrix.pl -h

####################################################################################

# Usage: /home/kazu/mambaforge/envs/trinity/bin/abundance_estimates_to_matrix.pl

--est_method <method> sample1.results sample2.results ...

# or /home/kazu/mambaforge/envs/trinity/bin/abundance_estimates_to_matrix.pl

--est_method <method> --quant_files file.listing_target_files.txt

# Note, if only a single input file is given, it's expected to

contain the paths to all the target abundance estimation files.

# Required:

# --est_method <string> RSEM|eXpress|kallisto|salmon

(needs to know what format to expect)

# --gene_trans_map <string> the gene-to-transcript mapping

file. (if you don't want gene estimates, indicate 'none'.

# Options:

# --cross_sample_norm <string> TMM|UpperQuartile|none (default: TMM)

# --name_sample_by_basedir name sample column by dirname

instead of filename

# --basedir_index <int> default(-2)

# --out_prefix <string> default: value for --est_method

# --quant_files <string> file containing a list of all

the target files.

######################################################################################

> filter_low_expr_transcripts.pl

##########################################################################################

# --matrix|m <string> expression matrix (TPM or FPKM, *not* raw counts)

# --transcripts|t <string> transcripts fasta file (eg. Trinity.fasta)

# # expression level filter:

# --min_expr_any <float> minimum expression level required across any sample (default: 0)

# # Isoform-level filtering

# --min_pct_dom_iso <int> minimum percent of dominant isoform expression (default: 0)

# or

# --highest_iso_only only retain the most highly expressed isoform per gene (default: off)

# (mutually exclusive with --min_pct_dom_iso param)

# # requires gene-to-transcript mappings

# --trinity_mode targets are Trinity-assembled transcripts

# or

# --gene_to_trans_map <string> file containing gene-to-transcript mappings

# (format is: gene(tab)transcript )

#########################################################################################

実行方法

１、Trinityのアセンブリにfastqを当てて定量する。Trinityのスクリプトalign_and_estimate_abundance.plを使う。

align_and_estimate_abundance.plのランではfastqを指定するが、複数条件かつレプリケートがある時は、fastqのパスを指定せずにfastqとサンプルとの関係を示したリストファイルを作り、そのファイルを"--samples_file <list>"で指定する。こうすると、サンプル関係を考慮して一度に全ての結果を出力してくれる。

リストファイルは、サンプルグループ名<tab>反復名<tab>ペアエンドR1<tab>ペアエンドR2、のタブ区切りファイルで書く（シングルエンドなら４列目は不要）。

cond_A cond_A_rep1 A_rep1_left.fq A_rep1_right.fq
cond_A cond_A_rep2 A_rep2_left.fq A_rep2_right.fq
cond_B cond_B_rep1 B_rep1_left.fq B_rep1_right.fq
cond_B cond_B_rep2 B_rep2_left.fq B_rep2_right.fq

このリストファイルを指定する。strand ppecificなら--SS_lib_typeでペアエンドの向きを指定する（FRとRF）。ここでは高速なsalmonを使う。（salmonのエラーが出た時はsalmonを最新版にアップデートしたら治る可能性がある；mamba remove salmon -y && mamba install -c bioconda -y salmon）。

align_and_estimate_abundance.pl --transcripts Trinity.fasta \
--seqType fq --samples_file list --est_method salmon \
--trinity_mode --prep_reference --thread_count 30 \
--output_dir RSEM_outdir --SS_lib_type FR

サンプルごとに出力される。

f:id:kazumaxneo:20211225180020p:plain

各ディレクトリ内にあるquant.sf（transcript level）かquant.sf.gene（gene level）をステップ２で使う。

２、結果を統合する。各ディレクトリ内にあるquant.sfを指定する。.gene_trans_mapは１のランで出来ているので、それを使う。

abundance_estimates_to_matrix.pl --est_method salmon --gene_trans_map \
Trinity.fasta.gene_trans_map --out_prefix salmon --name_sample_by_basedir \
group_A_rep_1/quant.sf group_A_rep_2/quant.sf \
group_B_rep_1/quant.sf group_B_rep_2/quant.sf

--est_method RSEM|eXpress|kallisto|salmon
--gene_trans_map the gene-to-transcript mapping
--cross_sample_norm TMM|UpperQuartile|none (default: TMM)

出力例

f:id:kazumaxneo:20211225212317p:plain

TMM正規化、TPM正規化、rawのそれぞれのカウントマトリックスが出力される。

３、得られたTPMの発現行列を使い、TPMの発現量分布を調べる。

git clone https://github.com/trinityrnaseq/trinityrnaseq.git
perl /trinityrnaseq/util/misc/count_matrix_features_given_MIN_TPM_threshold.pl \
genes_matrix.TPM.not_cross_norm | tee genes_matrix.TPM.not_cross_norm.counts_by_min_TPM

たくさんのサンプルの中で、少なくともxxxのTPM値で発現している遺伝子がいくつあるか表示される。例えば全てのサンプルでTPM１以下の転写産物を除くなどの判断に使える。

視覚化する。

R
> data = read.table("genes_matrix.TPM.not_cross_norm.counts_by_min_TPM", header=T)
> plot(data, xlim=c(-100,0), ylim=c(0,100000), t='b')

f:id:kazumaxneo:20211225230729p:plain

発現量のNx統計を計算するスクリプトも用意されている（マニュアル）。

４、転写産物をフィルタリングして低発現のものを除外する。ここではTPMが全てのサンプルで1以下を除外。

filter_low_expr_transcripts.pl --matrix Salmon.isoform.TPM.not_cross_norm --transcripts Trinity.fasta --min_expr_any 1 --gene_to_trans_map --trinity_mode

発現量値のみに基づいて転写物をフィルタリングすることは、生物学的に関連する転写物をデータから簡単に破棄することになるので、注意が必要。理想的には、発現値と機能アノテーションデータの組み合わせを考慮したフィルタリングが必要（マニュアルより）。

引用

De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis

Brian J Haas, Alexie Papanicolaou, Moran Yassour, Manfred Grabherr, Philip D Blood, Joshua Bowden, Matthew Brian Couger, David Eccles, Bo Li, Matthias Lieber, Matthew D MacManes, Michael Ott, Joshua Orvis, Nathalie Pochet, Francesco Strozzi, Nathan Weeks, Rick Westerman, Thomas William, Colin N Dewey, Robert Henschel, Richard D LeDuc, Nir Friedman , Aviv Regev

Nat Protoc. 2013 Aug;8(8):1494-512