transcriptomeアセンブリから主要なアイソフォームを選抜するTrinityのfilter_low_expr_transcripts.plスクリプト

　Trinityに付属するスクリプトfilter_low_expr_transcripts.plは、align_and_estimate_abundance.plの出力（Bowtieアライナーを使用してRNA-SeqリードをTrinity転写物にアラインし、その後、RSEMを使って確率的アプローチでアバンダンス推定を実行した結果の発現行列）を入力として、発現量が多い主要なアイソフォームのアセンブリ配列を選抜する。低発現のアイソフォームはDe novo transcriptome解析では不要どころか邪魔になることが多く有用なスクリプトである。使い方だけ簡単に紹介しておく。

インストール

ubuntu18.04LTSでcondaの仮想環境を作ってテストした。

Github

conda create -n trinity python=3.8
conda activate trinity
conda install -c bioconda -y trinity

#rsemも使う
conda install -c bioconda rsem

#or clone repository
git clone https://github.com/trinityrnaseq/trinityrnaseq.git
cd trinityrnaseq/util/

> filter_low_expr_transcripts.pl

##########################################################################################

# --matrix|m <string> expression matrix (TPM or FPKM, *not* raw counts)

# --transcripts|t <string> transcripts fasta file (eg. Trinity.fasta)

# # expression level filter:

# --min_expr_any <float> minimum expression level required across any sample (default: 0)

# # Isoform-level filtering

# --min_pct_dom_iso <int> minimum percent of dominant isoform expression (default: 0)

# or

# --highest_iso_only only retain the most highly expressed isoform per gene (default: off)

# (mutually exclusive with --min_pct_dom_iso param)

# # requires gene-to-transcript mappings

# --trinity_mode targets are Trinity-assembled transcripts

# or

# --gene_to_trans_map <string> file containing gene-to-transcript mappings

# (format is: gene(tab)transcript )

#########################################################################################

実行方法

align_and_estimate_abundance.plで得たリードカウントファイルを指定する。

filter_low_expr_transcripts.pl --transcripts Trinity.fasta \
 --highest_iso_only \
 --trinity_mode \
 --matrix RSEM.genes.results \
 > output.fasta

--matrix|m <string> expression matrix (TPM or FPKM, *not* raw counts)
--highest_iso_only only retain the most highly expressed isoform per gene (default: off)

Trinityのアセンブリでは--gene_to_trans_mapは不要。

植物のRNAseqデータで--highest_iso_onlyをつけて試した際は、contig数が91400から31300にへり、BUSCOのduplecated buscoが2548（55.4%）から61（1.3%）まで低下した。

引用

Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data

Manfred G. Grabherr, Brian J. Haas, Moran Yassour, Joshua Z. Levin, Dawn A. Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Federica di Palma, Bruce W. Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Friedman, Aviv Regev

Nat Biotechnol. 2011 Jul; 29(7): 644–652

参考