macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Biological replicatesの品質を調べるためのTrinityのPtRスクリプト

 

Trinityに付属するスクリプトPtRは、生物学的複製が十分に相関していることを確認し、またサンプル間の関係を調査するためのユーティリティツールである。Trinityのabundance_estimates_to_matrix.plなどを使って得た発現量の行列ファイルを使うTrinityのマニュアルに習い、使い方を確認しておく。

 

 

 

インストール

ubuntu18.04でtrinityの仮想環境を作ってテストした。Rのバージョンは4.1.1。

依存

  • R

Github

mamba create -n trinity python=3.8
conda activate trinity
mamba install -c bioconda -y trinity

> PtR 

#################################################################################### 

#

#######################

# Inputs and Outputs: #

#######################

#

#  --matrix <string>        matrix.RAW.normalized.FPKM

#

#  Optional:

#

#  Sample groupings:

#

#  --samples <string>      tab-delimited text file indicating biological replicate relationships.

#                                   ex.

#                                        cond_A    cond_A_rep1

#                                        cond_A    cond_A_rep2

#                                        cond_B    cond_B_rep1

#                                        cond_B    cond_B_rep2

#

#  --gene_factors <string>   tab-delimited file containing gene-to-factor relationships.

#                               ex.

#                                    liver_enriched <tab> gene1

#                                    heart_enriched <tab> gene2

#                                    ...

#                            (use of this data in plotting is noted for corresponding plotting options)

#

#

#  --output <string>        prefix for output file (default: "${matrix_file}.heatmap")

#

#  --save                   save R session (as .RData file)

#  --no_reuse               do not reuse any existing .RData file on initial loading

#

#####################

#  Plotting Actions #

#####################

#

#  --compare_replicates        provide scatter, MA, QQ, and correlation plots to compare replicates.

#

#   

#

#  --barplot_sum_counts        generate a barplot that sums frag counts per replicate across all samples.

#

#  --boxplot_log2_dist <float>        generate a boxplot showing the log2 dist of counts where counts >= min fpkm

#

#  --sample_cor_matrix         generate a sample correlation matrix plot

#    --sample_cor_scale_limits <string>    ex. "-0.2,0.6"

#    --sample_cor_sum_gene_factor_expr <factor=string>    instead of plotting the correlation value, plot the sum of expr according to gene factor

#                                                         requires --gene_factors 

#

#  --sample_cor_subset_matrix <string>  plot the sample correlation matrix, but create a disjoint set for rows,cols.

#                                       The subset of the samples to provide as the columns is provided as parameter.

#

#  --gene_cor_matrix           generate a gene-level correlation matrix plot

#

#  --indiv_gene_cor <string>   generate a correlation matrix and heatmaps for '--top_cor_gene_count' to specified genes (comma-delimited list)

#      --top_cor_gene_count <int>   (requires '--indiv_gene_cor with gene identifier specified')

#      --min_gene_cor_val <float>   (requires '--indiv_gene_cor with gene identifier specified')

#

#  --heatmap                   genes vs. samples heatmap plot

#      --heatmap_scale_limits "<int,int>"  cap scale intensity to low,high  (ie.  "-5,5")

#      --heatmap_colorscheme <string>  default is 'purple,black,yellow'

#                                      a popular alternative is 'green,black,red'

#                                      Specify a two-color gradient like so: "black,yellow".

#

#     # sample (column) labeling order

#      --lexical_column_ordering        order samples by column name lexical order.

#      --specified_column_ordering <string>  comma-delimited list of column names (must match matrix exactly!)

#      --order_columns_by_samples_file  order the columns in the heatmap according to replicate name ordering in the samples file.

#

#     # gene (row) labeling order

#      --order_by_gene_factor           order the genes by their factor (given --gene_factors)

#

#  --gene_heatmaps <string>    generate heatmaps for just one or more specified genes

#                              Requires a comma-delimited list of gene identifiers.

#                              Plots one heatmap containing all specified genes, then separate heatmaps for each gene.

#                                 if --gene_factors set, will include factor annotations as color panel.

#                                 else if --prin_comp set, will include include principal component color panel.

#

#  --prin_comp <int>           generate principal components, include <int> top components in heatmap  

#      --add_prin_comp_heatmaps <int>  draw heatmaps for the top <int> features at each end of the prin. comp. axis.

#                                      (requires '--prin_comp') 

#      --add_top_loadings_pc_heatmap <int>  draw a heatmap containing the <int> top feature loadings across all PCs.

#      --R_prin_comp_method <string>        options: princomp, prcomp (default: prcomp)

#

#  --mean_vs_sd               expression variability plot. (highlight specific genes by category via --gene_factors )

#

#  --var_vs_count_hist <vartype=string>        create histogram of counts of samples having feature expressed within a given expression bin.

#                                              vartype can be any of 'sd|var|cv|fano'

#      --count_hist_num_bins <int>  number of bins to distribute counts in the histogram (default: 10)

#      --count_hist_max_expr <float>  maximum value for the expression histogram (default: max(data))

#      --count_hist_convert_percentages       convert the histogram counts to percentage values.

#

#

#  --per_gene_plots                   plot each gene as a separate expression plot (barplot or lineplot)

#    --per_gene_plot_width <float>     default: 2.5

#    --per_gene_plot_height <float>    default: 2.5

#    --per_gene_plots_per_row <int>   default: 1

#    --per_gene_plots_per_col <int>   default: 2

#    --per_gene_plots_incl_vioplot    include violin plots to show distribution of rep vals

#

########################################################

#  Data Filtering, in order of operation below:  #########################################################

#

#

## Column filters:

#

#  --restrict_samples <string>   comma-delimited list of samples to restrict to (comma-delim list)

#

#  --top_rows <int>         only include the top number of rows in the matrix, as ordered.

#

#  --min_colSums <float>      min number of fragments, default: 0

#

#  --min_expressed_genes <int>           minimum number of genes (rows) for a column (replicate) having at least '--min_gene_expr_val'

#       --min_gene_expr_val <float>   a gene must be at least this value expressed across all samples.  (default: 1)

#

#

## Row Filters:

#

#  --min_rowSums <float>      min number of fragments, default: 0

#

#  --gene_grep <string>     grep on string to restrict to genes

#

#  --min_across_ALL_samples_gene_expr_val <int>   a gene must have this minimum expression value across ALL samples to be retained.

#

#  --min_across_ANY_samples_gene_expr_val <int>   a gene must have at least this expression value across ANY single sample to be retained.

#

#  --min_gene_prevalence <int>   gene must be found expressed in at least this number of columns

#       --min_gene_expr_val <float>   a gene must be at least this value expressed across all samples.  (default: 1)

#

#  --minValAltNA <float>    minimum cell value after above transformations, otherwise convert to NA

#

#  --top_genes <int>        use only the top number of most highly expressed transcripts

#

#  --top_variable_genes <int>      Restrict to the those genes with highest coeff. of variability across samples (use median of replicates)

#

#      --var_gene_method <string>   method for ranking top variable genes ( 'coeffvar|anova', default: 'anova' )

#           --anova_maxFDR <float>    if anova chose, require FDR value <= anova_maxFDR  (default: 0.05)

#            or

#           --anova_maxP <float>    if set, over-rides anova_maxQ  (default, off, uses --anova_maxQ)

#

#  --top_variable_via_stdev_and_mean_expr    perform filtering based on the stdev vs. mean expression plot.

#      Requires both:               (note, if you used --log2 and/or --Zscale, settings below should use those transformed values)

#         --min_stdev_expr <float>       minimum standard deviation in expression

#         --min_mean_expr  <float>       minimum mean expression value 

#

######################################

#  Data transformations:             #

######################################

#

#  --CPM                    convert to counts per million (uses sum of totals before filtering)

#  --CPK                    convert to counts per thousand

#

#  --binary                 all values > 0 are set to 1.  All values < 0 are set to zero.

#

#  --log2

#

#  --center_rows            subtract row mean from each data point. (only used under '--heatmap' )

#

#  --Zscale_rows            Z-scale the values across the rows (genes)  

#

#########################

#  Clustering methods:  #

#########################

#

#  --gene_dist <string>        Setting used for --heatmap (samples vs. genes)

#                                  Options: euclidean, gene_cor

#                                           maximum, manhattan, canberra, binary, minkowski

#                                  (default: 'euclidean')  Note: if using 'gene_cor', set method using '--gene_cor' below.

#

#

#  --sample_dist <string>      Setting used for --heatmap (samples vs. genes)

#                                  Options: euclidean, sample_cor

#                                           maximum, manhattan, canberra, binary, minkowski

#                                  (default: 'euclidean')  Note: if using 'sample_cor', set method using '--sample_cor' below.

#

#

#  --gene_clust <string>       ward, single, complete, average, mcquitty, median, centroid, none (default: complete)

#  --sample_clust <string>     ward, single, complete, average, mcquitty, median, centroid, none (default: complete)

#

#  --gene_cor <string>             Options: pearson, spearman  (default: pearson)

#  --sample_cor <string>           Options: pearson, spearman  (default: pearson)

#

####################

#  Image settings: #

####################

#

#

#  --imgfmt <string>           image type (pdf,svg) with default: pdf

#

#  --img_width <int>           image width

#  --img_height <int>          image height

#

################

# Misc. params #

################

#

#  --write_intermediate_data_tables         writes out the data table after each transformation.

#

#  --show_pipeline_flowchart                describe order of events and exit.

#

####################################################################################

 

 

実行方法

1、昨日紹介した方法で発現行列のファイルを得る。

 

2、PtRのラン

PtRのランには、サンプルとの関係を示したリストファイルが必要。そのファイルを"--samples <list>"で指定する。PtRはリストファイルからサンプルのreplicatesを判断してRNAseqのデータを比較する。

リストファイルは、サンプルグループ名<tab>反復名のタブ区切りファイル。

cond_A    cond_A_rep1
cond_A    cond_A_rep2
cond_B    cond_B_rep1
cond_B    cond_B_rep2

このリストファイルを指定する。replicates比較なら”--compare_replicates”を付ける。Count Per Milion (CPM) に変換してから比較するなら”--CPM”を付ける。log2で比較するなら ”--log2”を付ける。 

/PtR --matrix counts.matrix --samples list --log2 --CPM \
--min_rowSums 10 --compare_replicates --barplot_sum_counts
  • --compare_replicates    provide scatter, MA, QQ, and correlation plots to compare replicates.
  • --CPM    convert to counts per million (uses sum of totals before filtering)
  • --min_rowSums    min number of fragments, default: 0

出力例

f:id:kazumaxneo:20211226130542p:plain

group_A.rep_compare.pdf(同じ条件のreplicates間での比較)

f:id:kazumaxneo:20211226130402p:plain

注;ここでは複製ではないサンプルを使用

 

salmon.gene.counts.matrix.barplot_sum_counts.pdf

f:id:kazumaxneo:20211226130736p:plain

"--barplot_sum_counts”を付けたときに出力される全サンプルのバープロット)

 

続いて全サンプルを比較するために相関行列のヒートマップを出力。”--compare_replicates”を外し、”--sample_cor_matrix”を指定する。

PtR --matrix counts.matrix --samples list --log2 --CPM \
--min_rowSums 10 --sample_cor_matrix
  • --sample_cor_matrix      generate a sample correlation matrix plot

全サンプルのペアワイズ相関行列のヒートマップが出力される。

 

複製間の関係を探るため主成分分析 (PCA) を行う。

PtR --matrix counts.matrix --samples list --log2 --CPM \
--min_rowSums 10 --center_rows --prin_comp 3
  • --prin_comp <int>  generate principal components, include <int> top components in heatmap

 

他にもヒートマップを描く”--heatmap”などがあります。オプションを確認して下さい。

引用

De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis

Brian J Haas, Alexie Papanicolaou, Moran Yassour, Manfred Grabherr, Philip D Blood, Joshua Bowden, Matthew Brian Couger, David Eccles, Bo Li, Matthias Lieber, Matthew D MacManes, Michael Ott, Joshua Orvis, Nathalie Pochet, Francesco Strozzi, Nathan Weeks, Rick Westerman, Thomas William, Colin N Dewey, Robert Henschel, Richard D LeDuc, Nir Friedman , Aviv Regev

Nat Protoc. 2013 Aug;8(8):1494-512