2021 1/9 タイトル修正、1/15 コマンドと説明追記、4/27 ベンチマーク論文追加2021 10/8
2021 10/8 gzipped fastqのオプション追記
2024/12/11 strandnessのエラーについて(*3)
RNA-Seqは転写産物の量を測定する方法に革命を起こしている。RNA-Seqデータからのトランスクリプトの定量における重要な課題は、複数の遺伝子やアイソフォームにマップされたリードの取り扱いである。この問題は、配列決定されたゲノムがない場合のde novoトランスクリプトームアセンブリを用いた定量化において特に重要であり、どのトランスクリプトが同じ遺伝子のアイソフォームであるかを決定することは困難である。第二の重要な問題は、RNA-Seq実験のデザインであり、リード数、リードの長さ、リードがcDNA断片の片方または両方の末端から来るかどうかという点である。
本研究では、シングルエンドまたはペアエンドのRNA-Seqデータから遺伝子とアイソフォームのアバンダンスを定量化するためのユーザーフレンドリーなソフトウェアパッケージであるRSEMを紹介する。RSEMは、アバンダンス推定値、95%信頼区間、可視化ファイルを出力し、RNA-Seqデータのシミュレーションも可能である。他の既存のツールとは対照的に、このソフトウェアはリファレンスゲノムを必要としない。したがって、de novoトランスクリプトームアセンブラと組み合わせることで、RSEMはゲノム配列のない種の正確なトランスクリプト定量を可能にする。シミュレーションおよび実データセットにおいて、RSEMはリファレンスゲノムに依存した定量法と比較して優れた性能を有している。また、RSEMが曖昧にマッピングされたリードを効果的に使用できることを利用して、ゲノムレベルの正確なアバンダンス推定値は、ショートシングルエンドリードを大量に使用した場合に最もよく得られることを示した。一方、単一遺伝子内のアイソフォームの相対的な頻度の推定は、各遺伝子の可能なスプライス形態の数に応じて、ペアエンドリードを使用することによって改善される可能性がある。
RSEMは、RNA-Seqデータから転写産物の量を定量するための正確で使いやすいソフトウェアツールである。基準となるゲノムの存在に依存しないため、特にde novo転写産物アセンブリの定量化に有用である。また、現在では比較的高価なRNA-Seqを用いた定量実験をコスト効率よく設計するための貴重な指針となっている。
インストール
#bioconda (link)
mamba create -n rsem -y python=3.8
conda activate rsem
mamba install -c bioconda rsem -y
> rsem-prepare-reference -h
NAME
rsem-prepare-reference - Prepare transcript references for RSEM and
optionally build BOWTIE/BOWTIE2/STAR/HISAT2(transcriptome) indices.
SYNOPSIS
rsem-prepare-reference [options] reference_fasta_file(s) reference_name
ARGUMENTS
reference_fasta_file(s)
Either a comma-separated list of Multi-FASTA formatted files OR a
directory name. If a directory name is specified, RSEM will read all
files with suffix ".fa" or ".fasta" in this directory. The files
should contain either the sequences of transcripts or an entire
genome, depending on whether the '--gtf' option is used.
reference name
The name of the reference used. RSEM will generate several
reference-related files that are prefixed by this name. This name can
contain path information (e.g. '/ref/mm9').
OPTIONS
--gtf <file>
If this option is on, RSEM assumes that 'reference_fasta_file(s)'
contains the sequence of a genome, and will extract transcript
reference sequences using the gene annotations specified in <file>,
which should be in GTF format.
If this and '--gff3' options are off, RSEM will assume
'reference_fasta_file(s)' contains the reference transcripts. In this
case, RSEM assumes that name of each sequence in the Multi-FASTA files
is its transcript_id.
(Default: off)
--gff3 <file>
The annotation file is in GFF3 format instead of GTF format. RSEM will
first convert it to GTF format with the file name
'reference_name.gtf'. Please make sure that 'reference_name.gtf' does
not exist. (Default: off)
--gff3-RNA-patterns <pattern>
<pattern> is a comma-separated list of transcript categories, e.g.
"mRNA,rRNA". Only transcripts that match the <pattern> will be
extracted. (Default: "mRNA")
--gff3-genes-as-transcripts
This option is designed for untypical organisms, such as viruses,
whose GFF3 files only contain genes. RSEM will assume each gene as a
unique transcript when it converts the GFF3 file into GTF format.
--trusted-sources <sources>
<sources> is a comma-separated list of trusted sources, e.g.
"ENSEMBL,HAVANA". Only transcripts coming from these sources will be
extracted. If this option is off, all sources are accepted. (Default:
off)
--transcript-to-gene-map <file>
Use information from <file> to map from transcript (isoform) ids to
gene ids. Each line of <file> should be of the form:
gene_id transcript_id
with the two fields separated by a tab character.
If you are using a GTF file for the "UCSC Genes" gene set from the
UCSC Genome Browser, then the "knownIsoforms.txt" file (obtained from
the "Downloads" section of the UCSC Genome Browser site) is of this
format.
If this option is off, then the mapping of isoforms to genes depends
on whether the '--gtf' option is specified. If '--gtf' is specified,
then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF
file. Otherwise, RSEM assumes that each sequence in the reference
sequence files is a separate gene.
(Default: off)
--allele-to-gene-map <file>
Use information from <file> to provide gene_id and transcript_id
information for each allele-specific transcript. Each line of <file>
should be of the form:
gene_id transcript_id allele_id
with the fields separated by a tab character.
This option is designed for quantifying allele-specific expression. It
is only valid if '--gtf' option is not specified. allele_id should be
the sequence names presented in the Multi-FASTA-formatted files.
(Default: off)
--polyA
Add poly(A) tails to the end of all reference isoforms. The length of
poly(A) tail added is specified by '--polyA-length' option. STAR
aligner users may not want to use this option. (Default: do not add
poly(A) tail to any of the isoforms)
--polyA-length <int>
The length of the poly(A) tails to be added. (Default: 125)
--no-polyA-subset <file>
Only meaningful if '--polyA' is specified. Do not add poly(A) tails to
those transcripts listed in <file>. <file> is a file containing a list
of transcript_ids. (Default: off)
--bowtie
Build Bowtie indices. (Default: off)
--bowtie-path <path>
The path to the Bowtie executables. (Default: the path to Bowtie
executables is assumed to be in the user's PATH environment variable)
--bowtie2
Build Bowtie 2 indices. (Default: off)
--bowtie2-path <path>
The path to the Bowtie 2 executables. (Default: the path to Bowtie 2
executables is assumed to be in the user's PATH environment variable)
--star
Build STAR indices. (Default: off)
--star-path <path>
The path to STAR's executable. (Default: the path to STAR executable
is assumed to be in user's PATH environment variable)
--star-sjdboverhang <int>
Length of the genomic sequence around annotated junction. It is only
used for STAR to build splice junctions database and not needed for
Bowtie or Bowtie2. It will be passed as the --sjdbOverhang option to
STAR. According to STAR's manual, its ideal value is
max(ReadLength)-1, e.g. for 2x101 paired-end reads, the ideal value is
101-1=100. In most cases, the default value of 100 will work as well
as the ideal value. (Default: 100)
--hisat2-hca
Build HISAT2 indices on the transcriptome according to Human Cell
Atlas (HCA) SMART-Seq2 pipeline. (Default: off)
--hisat2-path <path>
The path to the HISAT2 executables. (Default: the path to HISAT2
executables is assumed to be in the user's PATH environment variable)
-p/--num-threads <int>
Number of threads to use for building STAR's genome indices. (Default:
1)
-q/--quiet
Suppress the output of logging information. (Default: off)
-h/--help
Show help information.
PRIOR-ENHANCED RSEM OPTIONS
--prep-pRSEM
A Boolean indicating whether to prepare reference files for pRSEM,
including building Bowtie indices for a genome and selecting training
set isoforms. The index files will be used for aligning ChIP-seq reads
in prior-enhanced RSEM and the training set isoforms will be used for
learning prior. A path to Bowtie executables and a mappability file in
bigWig format are required when this option is on. Currently, Bowtie2
is not supported for prior-enhanced RSEM. (Default: off)
--mappability-bigwig-file <string>
Full path to a whole-genome mappability file in bigWig format. This
file is required for running prior-enhanced RSEM. It is used for
selecting a training set of isoforms for prior-learning. This file can
be either downloaded from UCSC Genome Browser or generated by GEM
(Derrien et al., 2012, PLoS One). (Default: "")
DESCRIPTION
This program extracts/preprocesses the reference sequences for RSEM and
prior-enhanced RSEM. It can optionally build Bowtie indices (with
'--bowtie' option) and/or Bowtie 2 indices (with '--bowtie2' option) using
their default parameters. It can also optionally build STAR indices (with
'--star' option) using parameters from ENCODE3's STAR-RSEM pipeline. For
prior-enhanced RSEM, it can build Bowtie genomic indices and select
training set isoforms (with options '--prep-pRSEM' and
'--mappability-bigwig-file <string>'). If an alternative aligner is to be
used, indices for that particular aligner can be built from either
'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (see OUTPUT for
details). This program is used in conjunction with the
'rsem-calculate-expression' program.
OUTPUT
This program will generate 'reference_name.grp', 'reference_name.ti',
'reference_name.transcripts.fa', 'reference_name.seq',
'reference_name.chrlist' (if '--gtf' is on), 'reference_name.idx.fa',
'reference_name.n2g.idx.fa', optional Bowtie/Bowtie 2 index files, and
optional STAR index files.
'reference_name.grp', 'reference_name.ti', 'reference_name.seq', and
'reference_name.chrlist' are used by RSEM internally.
'reference_name.transcripts.fa' contains the extracted reference
transcripts in Multi-FASTA format. Poly(A) tails are not added and it may
contain lower case bases in its sequences if the corresponding genomic
regions are soft-masked.
'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' are used by
aligners to build their own indices. In these two files, all sequence
bases are converted into upper case. In addition, poly(A) tails are added
if '--polyA' option is set. The only difference between
'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' is that
'reference_name.n2g.idx.fa' in addition converts all 'N' characters to 'G'
characters. This conversion is in particular desired for aligners (e.g.
Bowtie) that do not allow reads to overlap with 'N' characters in the
reference sequences. Otherwise, 'reference_name.idx.fa' should be used to
build the aligner's index files. RSEM uses 'reference_name.idx.fa' to
build Bowtie 2 indices and 'reference_name.n2g.idx.fa' to build Bowtie
indices. For visualizing the transcript-coordinate-based BAM files
generated by RSEM in IGV, 'reference_name.idx.fa' should be imported as a
"genome" (see Visualization section in README.md for details).
If the whole genome is indexed for prior-enhanced RSEM, all the index
files will be generated with prefix as 'reference_name_prsem'. Selected
isoforms for training set are listed in the file
'reference_name_prsem.training_tr_crd'
EXAMPLES
1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version
of the mouse genome. We have downloaded the UCSC Genes transcript
annotations in GTF format (as mm9.gtf) using the Table Browser and the
knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all
chromosome files for mm9 in the directory '/data/mm9'. We want to put the
generated reference files under '/ref' with name 'mouse_0'. We do not add
any poly(A) tails. Please note that GTF files generated from UCSC's Table
Browser do not contain isoform-gene relationship information. For the UCSC
Genes annotation, this information can be obtained from the
knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie
executables are found in '/sw/bowtie'.
There are two ways to write the command:
rsem-prepare-reference --gtf mm9.gtf \
--transcript-to-gene-map knownIsoforms.txt \
--bowtie \
--bowtie-path /sw/bowtie \
/data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
/ref/mouse_0
OR
rsem-prepare-reference --gtf mm9.gtf \
--transcript-to-gene-map knownIsoforms.txt \
--bowtie \
--bowtie-path /sw/bowtie \
/data/mm9 \
/ref/mouse_0
2) Suppose we also want to build Bowtie 2 indices in the above example and
Bowtie 2 executables are found in '/sw/bowtie2', the command will be:
rsem-prepare-reference --gtf mm9.gtf \
--transcript-to-gene-map knownIsoforms.txt \
--bowtie \
--bowtie-path /sw/bowtie \
--bowtie2 \
--bowtie2-path /sw/bowtie2 \
/data/mm9 \
/ref/mouse_0
3) Suppose we want to build STAR indices in the above example and save
index files under '/ref' with name 'mouse_0'. Assuming STAR executable is
'/sw/STAR', the command will be:
rsem-prepare-reference --gtf mm9.gtf \
--transcript-to-gene-map knownIsoforms.txt \
--star \
--star-path /sw/STAR \
-p 8 \
/data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
/ref/mouse_0
OR
rsem-prepare-reference --gtf mm9.gtf \
--transcript-to-gene-map knownIsoforms.txt \
--star \
--star-path /sw/STAR \
-p 8 \
/data/mm9
/ref/mouse_0
STAR genome index files will be saved under '/ref/'.
4) Suppose we want to prepare references for prior-enhanced RSEM in the
above example. In this scenario, both STAR and Bowtie are required to
build genomic indices - STAR for RNA-seq reads and Bowtie for ChIP-seq
reads. Assuming their executables are under '/sw/STAR' and '/sw/Bowtie',
respectively. Also, assuming the mappability file for mouse genome is
'/data/mm9.bigWig'. The command will be:
rsem-prepare-reference --gtf mm9.gtf \
--transcript-to-gene-map knownIsoforms.txt \
--star \
--star-path /sw/STAR \
-p 8 \
--prep-pRSEM \
--bowtie-path /sw/Bowtie \
--mappability-bigwig-file /data/mm9.bigWig \
/data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
/ref/mouse_0
OR
rsem-prepare-reference --gtf mm9.gtf \
--transcript-to-gene-map knownIsoforms.txt \
--star \
--star-path /sw/STAR \
-p 8 \
--prep-pRSEM \
--bowtie-path /sw/Bowtie \
--mappability-bigwig-file /data/mm9.bigWig \
/data/mm9
/ref/mouse_0
Both STAR and Bowtie's index files will be saved under '/ref/'. Bowtie
files will have name prefix 'mouse_0_prsem'
5) Suppose we only have transcripts from EST tags stored in 'mm9.fasta'
and isoform-gene information stored in 'mapping.txt'. We want to add 125bp
long poly(A) tails to all transcripts. The reference_name is set as
'mouse_125'. In addition, we do not want to build Bowtie/Bowtie 2 indices,
and will use an alternative aligner to align reads against either
'mouse_125.idx.fa' or 'mouse_125.idx.n2g.fa':
rsem-prepare-reference --transcript-to-gene-map mapping.txt \
--polyA
mm9.fasta \
mouse_125
> rsem-calculate-expression -h
NAME
rsem-calculate-expression - Estimate gene and isoform expression from
RNA-Seq data.
SYNOPSIS
rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name
rsem-calculate-expression [options] --paired-end upstream_read_file(s) downstream_read_file(s) reference_name sample_name
rsem-calculate-expression [options] --alignments [--paired-end] input reference_name sample_name
ARGUMENTS
upstream_read_files(s)
Comma-separated list of files containing single-end reads or upstream
reads for paired-end data. By default, these files are assumed to be
in FASTQ format. If the --no-qualities option is specified, then FASTA
format is expected.
downstream_read_file(s)
Comma-separated list of files containing downstream reads which are
paired with the upstream reads. By default, these files are assumed to
be in FASTQ format. If the --no-qualities option is specified, then
FASTA format is expected.
input
SAM/BAM/CRAM formatted input file. If "-" is specified for the
filename, the input is instead assumed to come from standard input.
RSEM requires all alignments of the same read group together. For
paired-end reads, RSEM also requires the two mates of any alignment be
adjacent. In addition, RSEM does not allow the SEQ and QUAL fields to
be empty. See Description section for how to make input file obey
RSEM's requirements.
reference_name
The name of the reference used. The user must have run
'rsem-prepare-reference' with this reference_name before running this
program.
sample_name
The name of the sample analyzed. All output files are prefixed by this
name (e.g., sample_name.genes.results)
BASIC OPTIONS
--paired-end
Input reads are paired-end reads. (Default: off)
--no-qualities
Input reads do not contain quality scores. (Default: off)
--strandedness <none|forward|reverse>
This option defines the strandedness of the RNA-Seq reads. It
recognizes three values: 'none', 'forward', and 'reverse'. 'none'
refers to non-strand-specific protocols. 'forward' means all
(upstream) reads are derived from the forward strand. 'reverse' means
all (upstream) reads are derived from the reverse strand. If
'forward'/'reverse' is set, the '--norc'/'--nofw' Bowtie/Bowtie 2
option will also be enabled to avoid aligning reads to the opposite
strand. For Illumina TruSeq Stranded protocols, please use 'reverse'.
(Default: 'none')
-p/--num-threads <int>
Number of threads to use. Both Bowtie/Bowtie2, expression estimation
and 'samtools sort' will use this many threads. (Default: 1)
--alignments
Input file contains alignments in SAM/BAM/CRAM format. The exact file
format will be determined automatically. (Default: off)
--fai <file>
If the header section of input alignment file does not contain
reference sequence information, this option should be turned on.
<file> is a FAI format file containing each reference sequence's name
and length. Please refer to the SAM official website for the details
of FAI format. (Default: off)
--bowtie2
Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM
does not handle indel, local and discordant alignments, the Bowtie2
parameters are set in a way to avoid those alignments. In particular,
we use options '--sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1
--score-min L,0,-0.1' by default. The last parameter of '--score-min',
'-0.1', is the negative of maximum mismatch rate. This rate can be set
by option '--bowtie2-mismatch-rate'. If reads are paired-end, we
additionally use options '--no-mixed' and '--no-discordant'. (Default:
off)
--star
Use STAR to align reads. Alignment parameters are from ENCODE3's
STAR-RSEM pipeline. To save computational time and memory resources,
STAR's Output BAM file is unsorted. It is stored in RSEM's temporary
directory with name as 'sample_name.bam'. Each STAR job will have its
own private copy of the genome in memory. (Default: off)
--hisat2-hca
Use HISAT2 to align reads to the transcriptome according to Human Cell
Atlast SMART-Seq2 pipeline. In particular, we use HISAT parameters "-k
10 --secondary --rg-id=$sampleToken --rg SM:$sampleToken --rg
LB:$sampleToken --rg PL:ILLUMINA --rg PU:$sampleToken --new-summary
--summary-file $sampleName.log --met-file $sampleName.hisat2.met.txt
--met 5 --mp 1,1 --np 1 --score-min L,0,-0.1 --rdg 99999999,99999999
--rfg 99999999,99999999 --no-spliced-alignment --no-softclip --seed
12345". If inputs are paired-end reads, we additionally use parameters
"--no-mixed --no-discordant". (Default: off)
--append-names
If gene_name/transcript_name is available, append it to the end of
gene_id/transcript_id (separated by '_') in files
'sample_name.isoforms.results' and 'sample_name.genes.results'.
(Default: off)
--seed <uint32>
Set the seed for the random number generators used in calculating
posterior mean estimates and credibility intervals. The seed must be a
non-negative 32 bit integer. (Default: off)
--single-cell-prior
By default, RSEM uses Dirichlet(1) as the prior to calculate posterior
mean estimates and credibility intervals. However, much less genes are
expressed in single cell RNA-Seq data. Thus, if you want to compute
posterior mean estimates and/or credibility intervals and you have
single-cell RNA-Seq data, you are recommended to turn on this option.
Then RSEM will use Dirichlet(0.1) as the prior which encourage the
sparsity of the expression levels. (Default: off)
--calc-pme
Run RSEM's collapsed Gibbs sampler to calculate posterior mean
estimates. (Default: off)
--calc-ci
Calculate 95% credibility intervals and posterior mean estimates. The
credibility level can be changed by setting '--ci-credibility-level'.
(Default: off)
-q/--quiet
Suppress the output of logging information. (Default: off)
-h/--help
Show help information.
--version
Show version information.
OUTPUT OPTIONS
--sort-bam-by-read-name
Sort BAM file aligned under transcript coordidate by read name.
Setting this option on will produce deterministic maximum likelihood
estimations from independent runs. Note that sorting will take long
time and lots of memory. (Default: off)
--no-bam-output
Do not output any BAM file. (Default: off)
--sampling-for-bam
When RSEM generates a BAM file, instead of outputting all alignments a
read has with their posterior probabilities, one alignment is sampled
according to the posterior probabilities. The sampling procedure
includes the alignment to the "noise" transcript, which does not
appear in the BAM file. Only the sampled alignment has a weight of 1.
All other alignments have weight 0. If the "noise" transcript is
sampled, all alignments appeared in the BAM file should have weight 0.
(Default: off)
--output-genome-bam
Generate a BAM file, 'sample_name.genome.bam', with alignments mapped
to genomic coordinates and annotated with their posterior
probabilities. In addition, RSEM will call samtools (included in RSEM
package) to sort and index the bam file.
'sample_name.genome.sorted.bam' and
'sample_name.genome.sorted.bam.bai' will be generated. (Default: off)
--sort-bam-by-coordinate
Sort RSEM generated transcript and genome BAM files by coordinates and
build associated indices. (Default: off)
--sort-bam-memory-per-thread <string>
Set the maximum memory per thread that can be used by 'samtools sort'.
<string> represents the memory and accepts suffices 'K/M/G'. RSEM will
pass <string> to the '-m' option of 'samtools sort'. Note that the
default used here is different from the default used by samtools.
(Default: 1G)
ALIGNER OPTIONS
--seed-length <int>
Seed length used by the read aligner. Providing the correct value is
important for RSEM. If RSEM runs Bowtie, it uses this value for
Bowtie's seed length parameter. Any read with its or at least one of
its mates' (for paired-end reads) length less than this value will be
ignored. If the references are not added poly(A) tails, the minimum
allowed value is 5, otherwise, the minimum allowed value is 25. Note
that this script will only check if the value >= 5 and give a warning
message if the value < 25 but >= 5. (Default: 25)
--phred33-quals
Input quality scores are encoded as Phred+33. This option is used by
Bowtie, Bowtie 2 and HISAT2. (Default: on)
--phred64-quals
Input quality scores are encoded as Phred+64 (default for GA Pipeline
ver. >= 1.3). This option is used by Bowtie, Bowtie 2 and HISAT2.
(Default: off)
--solexa-quals
Input quality scores are solexa encoded (from GA Pipeline ver. < 1.3).
This option is used by Bowtie, Bowtie 2 and HISAT2. (Default: off)
--bowtie-path <path>
The path to the Bowtie executables. (Default: the path to the Bowtie
executables is assumed to be in the user's PATH environment variable)
--bowtie-n <int>
(Bowtie parameter) max # of mismatches in the seed. (Range: 0-3,
Default: 2)
--bowtie-e <int>
(Bowtie parameter) max sum of mismatch quality scores across the
alignment. (Default: 99999999)
--bowtie-m <int>
(Bowtie parameter) suppress all alignments for a read if > <int> valid
alignments exist. (Default: 200)
--bowtie-chunkmbs <int>
(Bowtie parameter) memory allocated for best first alignment
calculation (Default: 0 - use Bowtie's default)
--bowtie2-path <path>
(Bowtie 2 parameter) The path to the Bowtie 2 executables. (Default:
the path to the Bowtie 2 executables is assumed to be in the user's
PATH environment variable)
--bowtie2-mismatch-rate <double>
(Bowtie 2 parameter) The maximum mismatch rate allowed. (Default: 0.1)
--bowtie2-k <int>
(Bowtie 2 parameter) Find up to <int> alignments per read. (Default:
200)
--bowtie2-sensitivity-level <string>
(Bowtie 2 parameter) Set Bowtie 2's preset options in --end-to-end
mode. This option controls how hard Bowtie 2 tries to find alignments.
<string> must be one of "very_fast", "fast", "sensitive" and
"very_sensitive". The four candidates correspond to Bowtie 2's
"--very-fast", "--fast", "--sensitive" and "--very-sensitive" options.
(Default: "sensitive" - use Bowtie 2's default)
--star-path <path>
The path to STAR's executable. (Default: the path to STAR executable
is assumed to be in user's PATH environment variable)
--star-gzipped-read-file
(STAR parameter) Input read file(s) is compressed by gzip. (Default:
off)
--star-bzipped-read-file
(STAR parameter) Input read file(s) is compressed by bzip2. (Default:
off)
--star-output-genome-bam
(STAR parameter) Save the BAM file from STAR alignment under genomic
coordinate to 'sample_name.STAR.genome.bam'. This file is NOT sorted
by genomic coordinate. In this file, according to STAR's manual,
'paired ends of an alignment are always adjacent, and multiple
alignments of a read are adjacent as well'. (Default: off)
--hisat2-path <path>
The path to HISAT2's executable. (Default: the path to HISAT2
executable is assumed to be in user's PATH environment variable)
ADVANCED OPTIONS
--tag <string>
The name of the optional field used in the SAM input for identifying a
read with too many valid alignments. The field should have the format
<tagName>:i:<value>, where a <value> bigger than 0 indicates a read
with too many alignments. (Default: "")
--fragment-length-min <int>
Minimum read/insert length allowed. This is also the value for the
Bowtie/Bowtie2 -I option. (Default: 1)
--fragment-length-max <int>
Maximum read/insert length allowed. This is also the value for the
Bowtie/Bowtie 2 -X option. (Default: 1000)
--fragment-length-mean <double>
(single-end data only) The mean of the fragment length distribution,
which is assumed to be a Gaussian. (Default: -1, which disables use of
the fragment length distribution)
--fragment-length-sd <double>
(single-end data only) The standard deviation of the fragment length
distribution, which is assumed to be a Gaussian. (Default: 0, which
assumes that all fragments are of the same length, given by the
rounded value of --fragment-length-mean)
--estimate-rspd
Set this option if you want to estimate the read start position
distribution (RSPD) from data. Otherwise, RSEM will use a uniform
RSPD. (Default: off)
--num-rspd-bins <int>
Number of bins in the RSPD. Only relevant when '--estimate-rspd' is
specified. Use of the default setting is recommended. (Default: 20)
--gibbs-burnin <int>
The number of burn-in rounds for RSEM's Gibbs sampler. Each round
passes over the entire data set once. If RSEM can use multiple
threads, multiple Gibbs samplers will start at the same time and all
samplers share the same burn-in number. (Default: 200)
--gibbs-number-of-samples <int>
The total number of count vectors RSEM will collect from its Gibbs
samplers. (Default: 1000)
--gibbs-sampling-gap <int>
The number of rounds between two succinct count vectors RSEM collects.
If the count vector after round N is collected, the count vector after
round N + <int> will also be collected. (Default: 1)
--ci-credibility-level <double>
The credibility level for credibility intervals. (Default: 0.95)
--ci-memory <int>
Maximum size (in memory, MB) of the auxiliary buffer used for
computing credibility intervals (CI). (Default: 1024)
--ci-number-of-samples-per-count-vector <int>
The number of read generating probability vectors sampled per sampled
count vector. The crebility intervals are calculated by first sampling
P(C | D) and then sampling P(Theta | C) for each sampled count vector.
This option controls how many Theta vectors are sampled per sampled
count vector. (Default: 50)
--keep-intermediate-files
Keep temporary files generated by RSEM. RSEM creates a temporary
directory, 'sample_name.temp', into which it puts all intermediate
output files. If this directory already exists, RSEM overwrites all
files generated by previous RSEM runs inside of it. By default, after
RSEM finishes, the temporary directory is deleted. Set this option to
prevent the deletion of this directory and the intermediate files
inside of it. (Default: off)
--temporary-folder <string>
Set where to put the temporary files generated by RSEM. If the folder
specified does not exist, RSEM will try to create it. (Default:
sample_name.temp)
--time
Output time consumed by each step of RSEM to 'sample_name.time'.
(Default: off)
PRIOR-ENHANCED RSEM OPTIONS
--run-pRSEM
Running prior-enhanced RSEM (pRSEM). Prior parameters, i.e. isoform's
initial pseudo-count for RSEM's Gibbs sampling, will be learned from
input RNA-seq data and an external data set. When pRSEM needs and only
needs ChIP-seq peak information to partition isoforms (e.g. in pRSEM's
default partition model), either ChIP-seq peak file (with the
'--chipseq-peak-file' option) or ChIP-seq FASTQ files for target and
input and the path for Bowtie executables are required (with the
'--chipseq-target-read-files <string>', '--chipseq-control-read-files
<string>', and '--bowtie-path <path> options), otherwise, ChIP-seq
FASTQ files for target and control and the path to Bowtie executables
are required. (Default: off)
--chipseq-peak-file <string>
Full path to a ChIP-seq peak file in ENCODE's narrowPeak, i.e. BED6+4,
format. This file is used when running prior-enhanced RSEM in the
default two-partition model. It partitions isoforms by whether they
have ChIP-seq overlapping with their transcription start site region
or not. Each partition will have its own prior parameter learned from
a training set. This file can be either gzipped or ungzipped.
(Default: "")
--chipseq-target-read-files <string>
Comma-separated full path of FASTQ read file(s) for ChIP-seq target.
This option is used when running prior-enhanced RSEM. It provides
information to calculate ChIP-seq peaks and signals. The file(s) can
be either ungzipped or gzipped with a suffix '.gz' or '.gzip'. The
options '--bowtie-path <path>' and '--chipseq-control-read-files
<string>' must be defined when this option is specified. (Default: "")
--chipseq-control-read-files <string>
Comma-separated full path of FASTQ read file(s) for ChIP-seq conrol.
This option is used when running prior-enhanced RSEM. It provides
information to call ChIP-seq peaks. The file(s) can be either
ungzipped or gzipped with a suffix '.gz' or '.gzip'. The options
'--bowtie-path <path>' and '--chipseq-target-read-files <string>' must
be defined when this option is specified. (Default: "")
--chipseq-read-files-multi-targets <string>
Comma-separated full path of FASTQ read files for multiple ChIP-seq
targets. This option is used when running prior-enhanced RSEM, where
prior is learned from multiple complementary data sets. It provides
information to calculate ChIP-seq signals. All files can be either
ungzipped or gzipped with a suffix '.gz' or '.gzip'. When this option
is specified, the option '--bowtie-path <path>' must be defined and
the option '--partition-model <string>' will be set to 'cmb_lgt'
automatically. (Default: "")
--chipseq-bed-files-multi-targets <string>
Comma-separated full path of BED files for multiple ChIP-seq targets.
This option is used when running prior-enhanced RSEM, where prior is
learned from multiple complementary data sets. It provides information
of ChIP-seq signals and must have at least the first six BED columns.
All files can be either ungzipped or gzipped with a suffix '.gz' or
'.gzip'. When this option is specified, the option '--partition-model
<string>' will be set to 'cmb_lgt' automatically. (Default: "")
--cap-stacked-chipseq-reads
Keep a maximum number of ChIP-seq reads that aligned to the same
genomic interval. This option is used when running prior-enhanced
RSEM, where prior is learned from multiple complementary data sets.
This option is only in use when either
'--chipseq-read-files-multi-targets <string>' or
'--chipseq-bed-files-multi-targets <string>' is specified. (Default:
off)
--n-max-stacked-chipseq-reads <int>
The maximum number of stacked ChIP-seq reads to keep. This option is
used when running prior-enhanced RSEM, where prior is learned from
multiple complementary data sets. This option is only in use when the
option '--cap-stacked-chipseq-reads' is set. (Default: 5)
--partition-model <string>
A keyword to specify the partition model used by prior-enhanced RSEM.
It must be one of the following keywords:
- pk
Partitioned by whether an isoform has a ChIP-seq peak overlapping
with its transcription start site (TSS) region. The TSS region is
defined as [TSS-500bp, TSS+500bp]. For simplicity, we refer this
type of peak as 'TSS peak' when explaining other keywords.
- pk_lgtnopk
First partitioned by TSS peak. Then, for isoforms in the 'no TSS
peak' set, a logistic model is employed to further classify them
into two partitions.
- lm3, lm4, lm5, or lm6
Based on their ChIP-seq signals, isoforms are classified into 3, 4,
5, or 6 partitions by a linear regression model.
- nopk_lm2pk, nopk_lm3pk, nopk_lm4pk, or
nopk_lm5pk
First partitioned by TSS peak. Then, for isoforms in the 'with TSS
peak' set, a linear regression model is employed to further classify
them into 2, 3, 4, or 5 partitions.
- pk_lm2nopk, pk_lm3nopk, pk_lm4nopk, or
pk_lm5nopk
First partitioned by TSS peak. Then, for isoforms in the 'no TSS
peak' set, a linear regression model is employed to further classify
them into 2, 3, 4, or 5 partitions.
- cmb_lgt
Using a logistic regression to combine TSS signals from multiple
complementary data sets and partition training set isoform into
'expressed' and 'not expressed'. This partition model is only in use
when either '--chipseq-read-files-multi-targets <string>' or
'--chipseq-bed-files-multi-targets <string> is specified.
Parameters for all the above models are learned from a training set.
For detailed explanations, please see prior-enhanced RSEM's paper.
(Default: 'pk')
DEPRECATED OPTIONS
The options in this section are deprecated. They are here only for
compatibility reasons and may be removed in future releases.
--sam
Inputs are alignments in SAM format. (Default: off)
--bam
Inputs are alignments in BAM format. (Default: off)
--strand-specific
Equivalent to '--strandedness forward'. (Default: off)
--forward-prob <double>
Probability of generating a read from the forward strand of a
transcript. Set to 1 for a strand-specific protocol where all
(upstream) reads are derived from the forward strand, 0 for a
strand-specific protocol where all (upstream) read are derived from
the reverse strand, or 0.5 for a non-strand-specific protocol.
(Default: off)
DESCRIPTION
In its default mode, this program aligns input reads against a reference
transcriptome with Bowtie and calculates expression values using the
alignments. RSEM assumes the data are single-end reads with quality
scores, unless the '--paired-end' or '--no-qualities' options are
specified. Alternatively, users can use STAR to align reads using the
'--star' option. RSEM has provided options in 'rsem-prepare-reference' to
prepare STAR's genome indices. Users may use an alternative aligner by
specifying '--alignments', and providing an alignment file in SAM/BAM/CRAM
format. However, users should make sure that they align against the
indices generated by 'rsem-prepare-reference' and the alignment file
satisfies the requirements mentioned in ARGUMENTS section.
One simple way to make the alignment file satisfying RSEM's requirements
is to use the 'convert-sam-for-rsem' script. This script accepts
SAM/BAM/CRAM files as input and outputs a BAM file. For example, type the
following command to convert a SAM file, 'input.sam', to a ready-for-use
BAM file, 'input_for_rsem.bam':
convert-sam-for-rsem input.sam input_for_rsem
For details, please refer to 'convert-sam-for-rsem's documentation page.
NOTES
1. Users must run 'rsem-prepare-reference' with the appropriate reference
before using this program.
2. For single-end data, it is strongly recommended that the user provide
the fragment length distribution parameters (--fragment-length-mean and
--fragment-length-sd). For paired-end data, RSEM will automatically learn
a fragment length distribution from the data.
3. Some aligner parameters have default values different from their
original settings.
4. With the '--calc-pme' option, posterior mean estimates will be
calculated in addition to maximum likelihood estimates.
5. With the '--calc-ci' option, 95% credibility intervals and posterior
mean estimates will be calculated in addition to maximum likelihood
estimates.
6. The temporary directory and all intermediate files will be removed when
RSEM finishes unless '--keep-intermediate-files' is specified.
With the '--run-pRSEM' option and associated options (see section
'PRIOR-ENHANCED RSEM OPTIONS' above for details), prior-enhanced RSEM will
be running. Prior parameters will be learned from supplied external data
set(s) and assigned as initial pseudo-counts for isoforms in the
corresponding partition for Gibbs sampling.
OUTPUT
sample_name.isoforms.results
File containing isoform level expression estimates. The first line
contains column names separated by the tab character. The format of
each line in the rest of this file is:
transcript_id gene_id length effective_length expected_count TPM FPKM
IsoPct [posterior_mean_count posterior_standard_deviation_of_count
pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound
TPM_ci_upper_bound TPM_coefficient_of_quartile_variation
FPKM_ci_lower_bound FPKM_ci_upper_bound
FPKM_coefficient_of_quartile_variation]
Fields are separated by the tab character. Fields within "" are
optional. They will not be presented if neither '--calc-pme' nor
'--calc-ci' is set.
'transcript_id' is the transcript name of this transcript. 'gene_id'
is the gene name of the gene which this transcript belongs to (denote
this gene as its parent gene). If no gene information is provided,
'gene_id' and 'transcript_id' are the same.
'length' is this transcript's sequence length (poly(A) tail is not
counted). 'effective_length' counts only the positions that can
generate a valid fragment. If no poly(A) tail is added,
'effective_length' is equal to transcript length - mean fragment
length + 1. If one transcript's effective length is less than 1, this
transcript's both effective length and abundance estimates are set to
0.
'expected_count' is the sum of the posterior probability of each read
comes from this transcript over all reads. Because 1) each read
aligning to this transcript has a probability of being generated from
background noise; 2) RSEM may filter some alignable low quality reads,
the sum of expected counts for all transcript are generally less than
the total number of reads aligned.
'TPM' stands for Transcripts Per Million. It is a relative measure of
transcript abundance. The sum of all transcripts' TPM is 1 million.
'FPKM' stands for Fragments Per Kilobase of transcript per Million
mapped reads. It is another relative measure of transcript abundance.
If we define l_bar be the mean transcript length in a sample, which
can be calculated as
l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every
transcript),
the following equation is hold:
FPKM_i = 10^3 / l_bar * TPM_i.
We can see that the sum of FPKM is not a constant across samples.
'IsoPct' stands for isoform percentage. It is the percentage of this
transcript's abandunce over its parent gene's abandunce. If its parent
gene has only one isoform or the gene information is not provided,
this field will be set to 100.
'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean
estimates calculated by RSEM's Gibbs sampler.
'posterior_standard_deviation_of_count' is the posterior standard
deviation of counts. 'IsoPct_from_pme_TPM' is the isoform percentage
calculated from 'pme_TPM' values.
'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and
'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95%
credibility intervals for TPM and FPKM values. The bounds are
inclusive (i.e. [l, u]).
'TPM_coefficient_of_quartile_variation' and
'FPKM_coefficient_of_quartile_variation' are coefficients of quartile
variation (CQV) for TPM and FPKM values. CQV is a robust way of
measuring the ratio between the standard deviation and the mean. It is
defined as
CQV := (Q3 - Q1) / (Q3 + Q1),
where Q1 and Q3 are the first and third quartiles.
sample_name.genes.results
File containing gene level expression estimates. The first line
contains column names separated by the tab character. The format of
each line in the rest of this file is:
gene_id transcript_id(s) length effective_length expected_count TPM
FPKM [posterior_mean_count posterior_standard_deviation_of_count
pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound
TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound
FPKM_ci_upper_bound FPKM_coefficient_of_quartile_variation]
Fields are separated by the tab character. Fields within "" are
optional. They will not be presented if neither '--calc-pme' nor
'--calc-ci' is set.
'transcript_id(s)' is a comma-separated list of transcript_ids
belonging to this gene. If no gene information is provided, 'gene_id'
and 'transcript_id(s)' are identical (the 'transcript_id').
A gene's 'length' and 'effective_length' are defined as the weighted
average of its transcripts' lengths and effective lengths (weighted by
'IsoPct'). A gene's abundance estimates are just the sum of its
transcripts' abundance estimates.
sample_name.alleles.results
Only generated when the RSEM references are built with allele-specific
transcripts.
This file contains allele level expression estimates for
allele-specific expression calculation. The first line contains column
names separated by the tab character. The format of each line in the
rest of this file is:
allele_id transcript_id gene_id length effective_length expected_count
TPM FPKM AlleleIsoPct AlleleGenePct [posterior_mean_count
posterior_standard_deviation_of_count pme_TPM pme_FPKM
AlleleIsoPct_from_pme_TPM AlleleGenePct_from_pme_TPM
TPM_ci_lower_bound TPM_ci_upper_bound
TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound
FPKM_ci_upper_bound FPKM_coefficient_of_quartile_variation]
Fields are separated by the tab character. Fields within "[]" are
optional. They will not be presented if neither '--calc-pme' nor
'--calc-ci' is set.
'allele_id' is the allele-specific name of this allele-specific
transcript.
'AlleleIsoPct' stands for allele-specific percentage on isoform level.
It is the percentage of this allele-specific transcript's abundance
over its parent transcript's abundance. If its parent transcript has
only one allele variant form, this field will be set to 100.
'AlleleGenePct' stands for allele-specific percentage on gene level.
It is the percentage of this allele-specific transcript's abundance
over its parent gene's abundance.
'AlleleIsoPct_from_pme_TPM' and 'AlleleGenePct_from_pme_TPM' have
similar meanings. They are calculated based on posterior mean
estimates.
Please note that if this file is present, the fields 'length' and
'effective_length' in 'sample_name.isoforms.results' should be
interpreted similarly as the corresponding definitions in
'sample_name.genes.results'.
sample_name.transcript.bam
Only generated when --no-bam-output is not specified.
'sample_name.transcript.bam' is a BAM-formatted file of read
alignments in transcript coordinates. The MAPQ field of each alignment
is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the
posterior probability of that alignment being the true mapping of a
read. In addition, RSEM pads a new tag ZW:f:value, where value is a
single precision floating number representing the posterior
probability. Because this file contains all alignment lines produced
by bowtie or user-specified aligners, it can also be used as a
replacement of the aligner generated BAM/SAM file.
sample_name.transcript.sorted.bam and
sample_name.transcript.sorted.bam.bai
Only generated when --no-bam-output is not specified and
--sort-bam-by-coordinate is specified.
'sample_name.transcript.sorted.bam' and
'sample_name.transcript.sorted.bam.bai' are the sorted BAM file and
indices generated by samtools (included in RSEM package).
sample_name.genome.bam
Only generated when --no-bam-output is not specified and
--output-genome-bam is specified.
'sample_name.genome.bam' is a BAM-formatted file of read alignments in
genomic coordinates. Alignments of reads that have identical genomic
coordinates (i.e., alignments to different isoforms that share the
same genomic region) are collapsed into one alignment. The MAPQ field
of each alignment is set to min(100, floor(-10 * log10(1.0 - w) +
0.5)), where w is the posterior probability of that alignment being
the true mapping of a read. In addition, RSEM pads a new tag
ZW:f:value, where value is a single precision floating number
representing the posterior probability. If an alignment is spliced, a
XS:A:value tag is also added, where value is either '+' or '-'
indicating the strand of the transcript it aligns to.
sample_name.genome.sorted.bam and
sample_name.genome.sorted.bam.bai
Only generated when --no-bam-output is not specified, and
--sort-bam-by-coordinate and --output-genome-bam are specified.
'sample_name.genome.sorted.bam' and
'sample_name.genome.sorted.bam.bai' are the sorted BAM file and
indices generated by samtools (included in RSEM package).
sample_name.time
Only generated when --time is specified.
It contains time (in seconds) consumed by aligning reads, estimating
expression levels and calculating credibility intervals.
sample_name.log
Only generated when --alignments is not specified.
It captures alignment statistics outputted from the user-specified
aligner.
sample_name.stat
This is a folder instead of a file. All model related statistics are
stored in this folder. Use 'rsem-plot-model' can generate plots using
this folder.
'sample_name.stat/sample_name.cnt' contains alignment statistics. The
format and meanings of each field are described in
'cnt_file_description.txt' under RSEM directory.
'sample_name.stat/sample_name.model' stores RNA-Seq model parameters
learned from the data. The format and meanings of each filed of this
file are described in 'model_file_description.txt' under RSEM
directory.
The following four output files will be generated only by
prior-enhanced RSEM
- 'sample_name.stat/sample_name_prsem.all_tr_features'
It stores isofrom features for deriving and assigning pRSEM prior.
The first line is a header and the rest is one isoform per line. The
description for each column is:
* trid: transcript ID from input annotation
* geneid: gene ID from input anntation
* chrom: isoform's chromosome name
* strand: isoform's strand name
* start: isoform's end with the lowest genomic loci
* end: isoform's end with the highest genomic loci
* tss_mpp: average mappability of [TSS-500bp, TSS+500bp], where TSS
is isoform's transcription start site, i.e. 5'-end
* body_mpp: average mappability of (TSS+500bp, TES-500bp), where TES
is isoform's transcription end site, i.e. 3'-end
* tes_mpp: average mappability of [TES-500bp, TES+500bp]
* pme_count: isoform's fragment or read count from RSEM's posterior
mean estimates
* tss: isoform's TSS loci
* tss_pk: equal to 1 if isoform's [TSS-500bp, TSS+500bp] region
overlaps with a RNA Pol II peak; 0 otherwise
* is_training: equal to 1 if isoform is in the training set where
Pol II prior is learned; 0 otherwise
- 'sample_name.stat/sample_name_prsem.all_tr_prior'
It stores prior parameters for every isoform. This file does not
have a header. Each line contains a prior parameter and an isoform's
transcript ID delimited by ` # `.
- 'sample_name.stat/sample_name_uniform_prior_1.isoforms.results'
RSEM's posterior mean estimates on the isoform level with an initial
pseudo-count of one for every isoform. It is in the same format as
the 'sample_name.isoforms.results'.
- 'sample_name.stat/sample_name_uniform_prior_1.genes.results'
RSEM's posterior mean estimates on the gene level with an initial
pseudo-count of one for every isoform. It is in the same format as
the 'sample_name.genes.results'.
When learning prior from multiple external data sets in prior-enhanced
RSEM, two additional output files will be generated.
- 'sample_name.stat/sample_name.pval_LL'
It stores a p-value and a log-likelihood. The p-value indicates
whether the combination of multiple complementary data sets is
informative for RNA-seq quantification. The log-likelihood shows how
well pRSEM's Dirichlet-multinomial model fits the read counts of
partitioned training set isoforms.
- 'sample_name.stat/sample_name.lgt_mdl.RData'
It stores an R object named 'glmmdl', which is a logistic regression
model on the training set isoforms and multiple external data sets.
In addition, extra columns will be added to
'sample_name.stat/all_tr_features'
* is_expr: equal to 1 if isoform has an abundance >= 1 TPM and a
non-zero read count from RSEM's posterior mean estimates; 0
otherwise
* "$external_data_set_basename": log10 of external data's signal at
[TSS-500, TSS+500]. Signal is the number of reads aligned within
that interval and normalized to RPKM by read depth and interval
length. It will be set to -4 if no read aligned to that interval.
There are multiple columns like this one, where each represents an
external data set.
* prd_expr_prob: predicted probability from logistic regression model
on whether this isoform is expressed or not. A probability higher
than 0.5 is considered as expressed
* partition: group index, to which this isoforms is partitioned
* prior: prior parameter for this isoform
EXAMPLES
Assume the path to the bowtie executables is in the user's PATH
environment variable. Reference files are under '/ref' with name
'mouse_125'.
1) '/data/mmliver.fq', single-end reads with quality scores. Quality
scores are encoded as for 'GA pipeline version >= 1.3'. We want to use 8
threads and generate a genome BAM file. In addition, we want to append
gene/transcript names to the result files:
rsem-calculate-expression --phred64-quals \
-p 8 \
--append-names \
--output-genome-bam \
/data/mmliver.fq \
/ref/mouse_125 \
mmliver_single_quals
2) '/data/mmliver_1.fq' and '/data/mmliver_2.fq', stranded paired-end
reads with quality scores. Suppose the library is prepared using TruSeq
Stranded Kit, which means the first mate should map to the reverse strand.
Quality scores are in SANGER format. We want to use 8 threads and do not
generate a genome BAM file:
rsem-calculate-expression -p 8 \
--paired-end \
--strandedness reverse \
/data/mmliver_1.fq \
/data/mmliver_2.fq \
/ref/mouse_125 \
mmliver_paired_end_quals
3) '/data/mmliver.fa', single-end reads without quality scores. We want to
use 8 threads:
rsem-calculate-expression -p 8 \
--no-qualities \
/data/mmliver.fa \
/ref/mouse_125 \
mmliver_single_without_quals
4) Data are the same as 1). This time we assume the bowtie executables are
under '/sw/bowtie'. We want to take a fragment length distribution into
consideration. We set the fragment length mean to 150 and the standard
deviation to 35. In addition to a BAM file, we also want to generate
credibility intervals. We allow RSEM to use 1GB of memory for CI
calculation:
rsem-calculate-expression --bowtie-path /sw/bowtie \
--phred64-quals \
--fragment-length-mean 150.0 \
--fragment-length-sd 35.0 \
-p 8 \
--output-genome-bam \
--calc-ci \
--ci-memory 1024 \
/data/mmliver.fq \
/ref/mouse_125 \
mmliver_single_quals
5) '/data/mmliver_paired_end_quals.bam', BAM-formatted alignments for
paired-end reads with quality scores. We want to use 8 threads:
rsem-calculate-expression --paired-end \
--alignments \
-p 8 \
/data/mmliver_paired_end_quals.bam \
/ref/mouse_125 \
mmliver_paired_end_quals
6) '/data/mmliver_1.fq.gz' and '/data/mmliver_2.fq.gz', paired-end reads
with quality scores and read files are compressed by gzip. We want to use
STAR to aligned reads and assume STAR executable is '/sw/STAR'. Suppose we
want to use 8 threads and do not generate a genome BAM file:
rsem-calculate-expression --paired-end \
--star \
--star-path /sw/STAR \
--gzipped-read-file \
--paired-end \
-p 8 \
/data/mmliver_1.fq.gz \
/data/mmliver_2.fq.gz \
/ref/mouse_125 \
mmliver_paired_end_quals
7) In the above example, suppose we want to run prior-enhanced RSEM
instead. Assuming we want to learn priors from a ChIP-seq peak file
'/data/mmlive.narrowPeak.gz':
rsem-calculate-expression --star \
--star-path /sw/STAR \
--gzipped-read-file \
--paired-end \
--calc-pme \
--run-pRSEM \
--chipseq-peak-file /data/mmliver.narrowPeak.gz \
-p 8 \
/data/mmliver_1.fq.gz \
/data/mmliver_2.fq.gz \
/ref/mouse_125 \
mmliver_paired_end_quals
8) Similar to the example in 7), suppose we want to use the partition
model 'pk_lm2nopk' (partitioning isoforms by Pol II TSS peak first and
then partitioning 'no TSS peak' isoforms into two bins by a linear
regression model), and we want to partition isoforms by RNA Pol II's
ChIP-seq read files '/data/mmliver_PolIIRep1.fq.gz' and
'/data/mmliver_PolIIRep2.fq.gz', and the control ChIP-seq read files
'/data/mmliver_ChIPseqCtrl.fq.gz'. Also, assuming Bowtie's executables are
under '/sw/bowtie/':
rsem-calculate-expression --star \
--star-path /sw/STAR \
--gzipped-read-file \
--paired-end \
--calc-pme \
--run-pRSEM \
--chipseq-target-read-files /data/mmliver_PolIIRep1.fq.gz,/data/mmliver_PolIIRep2.fq.gz \
--chipseq-control-read-files /data/mmliver_ChIPseqCtrl.fq.gz \
--partition-model pk_lm2nopk \
--bowtie-path /sw/bowtie \
-p 8 \
/data/mmliver_1.fq.gz \
/data/mmliver_2.fq.gz \
/ref/mouse_125 \
mmliver_paired_end_quals
9) Similar to the example in 8), suppose we want to derive prior from four
histone modification ChIP-seq read data sets: '/data/H3K27Ac.fastq.gz',
'/data/H3K4me1.fastq.gz', '/data/H3K4me2.fastq.gz', and
'/data/H3K4me3.fastq.gz'. Also, assuming Bowtie's executables are under
'/sw/bowtie/':
rsem-calculate-expression --star \
--star-path /sw/STAR \
--gzipped-read-file \
--paired-end \
--calc-pme \
--run-pRSEM \
--partition-model cmb_lgt \
--chipseq-read-files-multi-targets /data/H3K27Ac.fastq.gz,/data/H3K4me1.fastq.gz,/data/H3K4me2.fastq.gz,/data/H3K4me3.fastq.gz \
--bowtie-path /sw/bowtie \
-p 8 \
/data/mmliver_1.fq.gz \
/data/mmliver_2.fq.gz \
/ref/mouse_125 \
mmliver_paired_end_quals
> rsem-generate-data-matrix
$ rsem-generate-data-matrix
Usage: rsem-generate-data-matrix sampleA.[alleles/genes/isoforms].results sampleB.[alleles/genes/isoforms].results ... > output_name.matrix
All result files should have the same file type. The 'expected_count' columns of every result file are extracted to form the data matrix.
他にも多くのコマンドがある。
> rsem-
$ rsem-
rsem-bam2readdepth
rsem-extract-reference-transcripts
rsem-get-unique
rsem-preref
rsem-scan-for-paired-end-reads rsem-bam2wig rsem-for-ebseq-calculate-clustering-info
rsem-gff3-to-gtf
rsem-refseq-extract-primary-assembly
rsem-simulate-reads
rsem-build-read-index
rsem-for-ebseq-find-DE
rsem-parse-alignments
rsem-run-ebseq
rsem-synthesis-reference-transcripts
rsem-calculate-credibility-intervals
rsem-for-ebseq-generate-ngvector-from-clustering-info
rsem-plot-model
rsem-run-em
rsem-tbam2gbam
rsem-calculate-expression
rsem-gen-transcript-plots
rsem-plot-transcript-wiggles
rsem-run-gibbs
rsem-control-fdr
rsem-generate-data-matrix
rsem-prepare-reference
rsem-sam-validator
実行方法
1、indexing
GTF ファイルとゲノムのfasta、最後にindex名を指定する(1と同じにする)。ランの過程でbowtie2/star/hisat2のindexも作成される( この例ではxxx.index.〜)。--gff3を使えばGTFの代わりにGFF3のアノテーションを与えることもできる。
#bowtie2
rsem-prepare-reference --gtf genome.gtf --bowtie2 --bowtie2-path <path>/<to>/<your>/<bowtie2-path> -p 20 genome.fa bowtie2_index
#star
rsem-prepare-reference --gtf genome.gtf --star --star-path <path>/<to>/<your>/<star-path> -p 20 genome.fa star_index
#hisat2
rsem-prepare-reference --gtf genome.gtf --hisat2-hca --hisat2-path <path>/<to>/<your>/<hisat2-path> -p 20 genome.fa hisat2_index
- --gtf If this option is on, RSEM assumes that 'reference_fasta_file(s)' contains the sequence of a genome, and will extract transcript reference sequences using the gene annotations specified in <file>, which should be in GTF format.
If this and '--gff3' options are off, RSEM will assume 'reference_fasta_file(s)' contains the reference transcripts. In this case, RSEM assumes that name of each sequence in the Multi-FASTA files is its transcript_id. (Default: off) - --gff3 The annotation file is in GFF3 format instead of GTF format. RSEM will first convert it to GTF format with the file name 'reference_name.gtf'. Please make sure that 'reference_name.gtf' does not exist. (Default: off)
- --bowtie2 Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM
does not handle indel, local and discordant alignments, the Bowtie2 parameters are set in a way to avoid those alignments. In particular, we use options '- sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1' by default. The last parameter of '--score-min', '-0.1', is the negative of maximum mismatch rate. This rate can be set by option '--bowtie2-mismatch-rate'. If reads are paired-end, we additionally use options '--no-mixed' and '--no-discordant'. (Default: off) - --star Use STAR to align reads. Alignment parameters are from ENCODE3's
STAR-RSEM pipeline. To save computational time and memory resources,
STAR's Output BAM file is unsorted. It is stored in RSEM's temporary
directory with name as 'sample_name.bam'. Each STAR job will have
its own private copy of the genome in memory. (Default: off) - --hisat2-hca Use HISAT2 to align reads to the transcriptome according to Human. Cell Atlast SMART-Seq2 pipeline. In particular, we use HISAT
parameters "-k 10 --secondary --rg-id=$sampleToken --rg
SM:$sampleToken --rg LB:$sampleToken --rg PL:ILLUMINA --rg
PU:$sampleToken --new-summary --summary-file $sampleName.log
--met-file $sampleName.hisat2.met.txt --met 5 --mp 1,1 --np 1
--score-min L,0,-0.1 --rdg 99999999,99999999 --rfg 99999999,99999999
--no-spliced-alignment --no-softclip --seed 12345". If inputs are
paired-end reads, we additionally use parameters "--no-mixed
--no-discordant". (Default: off) - --transcript-to-gene-map Use information from <file> to map from transcript (isoform) ids to gene ids. Each line of <file> should be of the form: gene_id transcript_id with the two fields separated by a tab character. If you are using a GTF file for the "UCSC Genes" gene set from the UCSC Genome Browser, then the "knownIsoforms.txt" file (obtained from the "Downloads" section of the UCSC Genome Browser site) is of this format. If this option is off, then the mapping of isoforms to genes depends on whether the '--gtf' option is specified. If '--gtf' is specified, then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF file. Otherwise, RSEM assumes that each sequence in the reference sequence files is a separate gene. (Default: off)
- --prep-pRSEM A Boolean indicating whether to prepare reference files for pRSEM, including building Bowtie indices for a genome and selecting training set isoforms. The index files will be used for aligning ChIP-seq reads in prior-enhanced RSEM and the training set isoforms will be used for learning prior. A path to Bowtie executables and a mappability file in bigWig format are required when this option is on. Currently, Bowtie2 is not supported for prior-enhanced RSEM. (Default: off)
- --mappability-bigwig-file Full path to a whole-genome mappability file in bigWig format. This file is required for running prior-enhanced RSEM. It is used for selecting a training set of isoforms for prior-learning. This file can be either downloaded from UCSC Genome Browser or generated by GEM (Derrien et al., 2012, PLoS One). (Default: "")
- --strandedness <none|forward|reverse> This option defines the strandedness of the RNA-Seq reads. It recognizes three values: 'none', 'forward', and 'reverse'. 'none' refers to non-strand-specific protocols. 'forward' means all (upstream) reads are derived from the forward strand. 'reverse' means all (upstream) reads are derived from the reverse strand. If 'forward'/'reverse' is set, the '--norc'/'--nofw' Bowtie/Bowtie 2 option will also be enabled to avoid aligning reads to the opposite strand. For Illumina TruSeq Stranded protocols, please use 'reverse'. (Default: 'none')
インデックスxxx.index~が出力される。注意点として、STARを使う場合はメモリが多い環境で実行すること(32GB程度では足りないことが多い)。メモリが少ない環境でSTAR indexを作成すると、genomeParameterファイルが出力されない事がある。このファイルがないと次のステップでエラーを起こす。
”--transcript-to-gene-map”についてはこちらを参照。
How to get --transcript-to-gene-map <file> in RSEM?
2、rsem-calculate
オプションの後にペアエンドfastq、index名(bowtie2の場合、step1のコマンドでbowtie2.indexと名前をつけているので”bowtie2.index”と指定)、出力ファイル名を指定する。gzipped fastqは使えないので解凍して指定するか"--gzipped-read-file"オプションをつける(注2)。シングルエンドなら--paired-endを外す。
#bowtie2
rsem-calculate-expression --paired-end -p 20 --bowtie2 --bowtie2-path <path>/<to>/<your>/<bowtie2-path> \
sample1_R1.fq sample1_R2.fq bowtie2_index sample1
#star
rsem-calculate-expression --paired-end -p 20 --star --star-path <path>/<to>/<your>/<star-path> \
sample1_R1.fq sample1_R2.fq star_index sample1
#hisat2
rsem-calculate-expression --paired-end -p 20 --hisat2-hca --hisat2-path <path>/<to>/<your>/<hisat2-path> \
sample1_R1.fq sample1_R2.fq hisat2_index sample1
sample1.genes.resultsとsample1.isofomrs.resultsその他のファイルが出力される。
bamが最後の方で生じ、ディスクに保存される。このリファレンスはゲノムでは無く転写産物リファレンスである。転写産物をリファレンスにしてマッピングしているので、IGV等のビューアで見る際には間違えないようにしたい。STARならstar_index.transcripts.faになる(別の見方をすれば、転写産物に当てているので、bowtieではスプリットアラインメントはどうなっているのかとか、hisat2 等のスプリットアラインメントの距離の最大値等は気にしなくて良い)。
3、複数の結果を統合 (expected countの統合*1)
#gene level
rsem-generate-data-matrix sample*genes.results > output
#transcripts(isoform) level
rsem-generate-data-matrix sample*isoforms.results > output
expected read count(RSEMの確率的リードカウント値で正規化されていない生の値)の表が出力される。先頭行の名前だけ修正してiDEP(修正)などにロードすれば、すぐに結果を得ることができる。その時はまずPCAなどを行なって、妥当なグループになっているかどうか確認すること。
その他
マッピングを別途実行し、その結果のbamファイルRSEMに提供することもできる。その場合はアラインメントオプション、特にリピートのマッピング設定に注意する。
例えばSTARの場合
1、indexing and mapping
#index
STAR --runMode genomeGenerate --genomeDir STAR_index --genomeFastaFiles genome.fa --sjdbGTFfile annotation.gtf
#mapping
STAR --runThreadN 12 --genomeDir STAR_index --readFilesIn pair_R1.fq.gz pair_R2.fq.gz --readFilesCommand zcat --quantMode TranscriptomeSAM --outSAMtype BAM SortedByCoordinate --genomeLoad NoSharedMemory --outFilterMultimapNmax 1 --outFileNamePrefix sample1
2、read count
bamファイルを指定する際は--alignmentsのフラグを立てる。
rsem-calculate-expression --alignments --paired-end -p 20 --bowtie2 input.bam bowtie2_index sample1
- --alignments Input file contains alignments in SAM/BAM/CRAM format. The exact file format will be determined automatically. (Default: off)
引用
RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome
Bo Li, Colin N Dewey
BMC Bioinformatics. 2011 Aug 4
参考
関連
*1
genes.results やisofomrs.resultsにはカウント値として確率的カウント値、TPMとFPKMの正規化されたカウント値の3つプリントされている。rsem-generate-data-matrixやrsem-generate-data-matrixは確率的カウント値を統合する。切り替えるオプションはないので、TPMやFPKMを取り出したいならその列だけ抜き出して横に繋ぐ。
#TPM
cut -f 6 sample1.genes.results > sample1
cut -f 6 sample2.genes.results > sample2
cut -f 6 sample3.genes.results > sample3
paste sample1 sample2 sample3 > TPM
#FPKM
cut -f 7 sample1.genes.results > sample1
cut -f 7 sample2.genes.results > sample2
cut -f 7 sample3.genes.results > sample3
paste sample1 sample2 sample3 > FPKM
*2
ただし、このオプションがないrsemのバージョンもある。また、gzipped fastqを暗黙的に受け付けるが、結果がかなり変わってしまうことがあった(自分だけか不明)。ややこしければ、解凍したfastqを指定するか、アラインは自分で行って、bamからrsemをスタートする。
追記
ベンチマークペーパー
2つの独立したデータセットを用いて、7つの競合パイプラインを評価した。性能は全般的に低く、2つの手法が明らかに劣っており、RSEMが残りの手法をわずかに上回っていた。
*3
Parsed 400000 lines
Parsed 600000 lines
The GTF file might be corrupted!
Stop at line : NC_001320.1 RefSeq transcript 67390 126645 . ? . gene_id "OrsajCp102"; transcript_id "unassigned_transcript_687"; exception "trans-splicing"; gbkey "mRNA"; gene "rps12"; locus_tag "OrsajCp102"; transcript_biotype "mRNA";
Error Message: Strand is neither '+' nor '-'!
"rsem-extract-reference-transcripts hisat2_index 0 GCF_001433935.1.gtf None 0 GCF_001433935.1_IRGSP-1.0_genomic.fna" failed! Plase check if you provide correct parameters/options for the pipeline!
このようなエラーはストランドが定義されていないtranscriptがGTF/GFFに含まれていることが原因で起こる。誤差が生じることを承知の上でこの遺伝子モデル全行を消すとエラーは回避できる(例: grep -v "gene_name_XXXXXX" input.gtf > out.gtf)。
関連