2020-12-25

転写産物レベルで正確なリードカウントを行う RSEM

2021 1/9 タイトル修正

2021 1/15 コマンドと説明追記

2021 4/27 ベンチマーク論文追加2021 10/8

2021 10/8 gzipped fastqのオプション追記

　RNA-Seqは転写産物の量を測定する方法に革命を起こしている。RNA-Seqデータからのトランスクリプトの定量における重要な課題は、複数の遺伝子やアイソフォームにマップされたリードの取り扱いである。この問題は、配列決定されたゲノムがない場合のde novoトランスクリプトームアセンブリを用いた定量化において特に重要であり、どのトランスクリプトが同じ遺伝子のアイソフォームであるかを決定することは困難である。第二の重要な問題は、RNA-Seq実験のデザインであり、リード数、リードの長さ、リードがcDNA断片の片方または両方の末端から来るかどうかという点である。
　本研究では、シングルエンドまたはペアエンドのRNA-Seqデータから遺伝子とアイソフォームのアバンダンスを定量化するためのユーザーフレンドリーなソフトウェアパッケージであるRSEMを紹介する。RSEMは、アバンダンス推定値、95%信頼区間、可視化ファイルを出力し、RNA-Seqデータのシミュレーションも可能である。他の既存のツールとは対照的に、このソフトウェアはリファレンスゲノムを必要としない。したがって、de novoトランスクリプトームアセンブラと組み合わせることで、RSEMはゲノム配列のない種の正確なトランスクリプト定量を可能にする。シミュレーションおよび実データセットにおいて、RSEMはリファレンスゲノムに依存した定量法と比較して優れた性能を有している。また、RSEMが曖昧にマッピングされたリードを効果的に使用できることを利用して、ゲノムレベルの正確なアバンダンス推定値は、ショートシングルエンドリードを大量に使用した場合に最もよく得られることを示した。一方、単一遺伝子内のアイソフォームの相対的な頻度の推定は、各遺伝子の可能なスプライス形態の数に応じて、ペアエンドリードを使用することによって改善される可能性がある。
　RSEMは、RNA-Seqデータから転写産物の量を定量するための正確で使いやすいソフトウェアツールである。基準となるゲノムの存在に依存しないため、特にde novo転写産物アセンブリの定量化に有用である。また、現在では比較的高価なRNA-Seqを用いた定量実験をコスト効率よく設計するための貴重な指針となっている。

インストール

Github

#bioconda (link)
mamba create -n rsem -y python=3.8
conda activate rsem
mamba install -c bioconda rsem -y

> rsem-prepare-reference -h

NAME

rsem-prepare-reference - Prepare transcript references for RSEM and

optionally build BOWTIE/BOWTIE2/STAR/HISAT2(transcriptome) indices.

SYNOPSIS

rsem-prepare-reference [options] reference_fasta_file(s) reference_name

ARGUMENTS

reference_fasta_file(s)

Either a comma-separated list of Multi-FASTA formatted files OR a

directory name. If a directory name is specified, RSEM will read all

files with suffix ".fa" or ".fasta" in this directory. The files

should contain either the sequences of transcripts or an entire

genome, depending on whether the '--gtf' option is used.

reference name

The name of the reference used. RSEM will generate several

reference-related files that are prefixed by this name. This name can

contain path information (e.g. '/ref/mm9').

OPTIONS

--gtf <file>

If this option is on, RSEM assumes that 'reference_fasta_file(s)'

contains the sequence of a genome, and will extract transcript

reference sequences using the gene annotations specified in <file>,

which should be in GTF format.

If this and '--gff3' options are off, RSEM will assume

'reference_fasta_file(s)' contains the reference transcripts. In this

case, RSEM assumes that name of each sequence in the Multi-FASTA files

is its transcript_id.

(Default: off)

--gff3 <file>

The annotation file is in GFF3 format instead of GTF format. RSEM will

first convert it to GTF format with the file name

'reference_name.gtf'. Please make sure that 'reference_name.gtf' does

not exist. (Default: off)

--gff3-RNA-patterns <pattern>

<pattern> is a comma-separated list of transcript categories, e.g.

"mRNA,rRNA". Only transcripts that match the <pattern> will be

extracted. (Default: "mRNA")

--gff3-genes-as-transcripts

This option is designed for untypical organisms, such as viruses,

whose GFF3 files only contain genes. RSEM will assume each gene as a

unique transcript when it converts the GFF3 file into GTF format.

--trusted-sources <sources>

<sources> is a comma-separated list of trusted sources, e.g.

"ENSEMBL,HAVANA". Only transcripts coming from these sources will be

extracted. If this option is off, all sources are accepted. (Default:

off)

--transcript-to-gene-map <file>

Use information from <file> to map from transcript (isoform) ids to

gene ids. Each line of <file> should be of the form:

gene_id transcript_id

with the two fields separated by a tab character.

If you are using a GTF file for the "UCSC Genes" gene set from the

UCSC Genome Browser, then the "knownIsoforms.txt" file (obtained from

the "Downloads" section of the UCSC Genome Browser site) is of this

format.

If this option is off, then the mapping of isoforms to genes depends

on whether the '--gtf' option is specified. If '--gtf' is specified,

then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF

file. Otherwise, RSEM assumes that each sequence in the reference

sequence files is a separate gene.

(Default: off)

--allele-to-gene-map <file>

Use information from <file> to provide gene_id and transcript_id

information for each allele-specific transcript. Each line of <file>

should be of the form:

gene_id transcript_id allele_id

with the fields separated by a tab character.

This option is designed for quantifying allele-specific expression. It

is only valid if '--gtf' option is not specified. allele_id should be

the sequence names presented in the Multi-FASTA-formatted files.

(Default: off)

--polyA

Add poly(A) tails to the end of all reference isoforms. The length of

poly(A) tail added is specified by '--polyA-length' option. STAR

aligner users may not want to use this option. (Default: do not add

poly(A) tail to any of the isoforms)

--polyA-length <int>

The length of the poly(A) tails to be added. (Default: 125)

--no-polyA-subset <file>

Only meaningful if '--polyA' is specified. Do not add poly(A) tails to

those transcripts listed in <file>. <file> is a file containing a list

of transcript_ids. (Default: off)

--bowtie

Build Bowtie indices. (Default: off)

--bowtie-path <path>

The path to the Bowtie executables. (Default: the path to Bowtie

executables is assumed to be in the user's PATH environment variable)

--bowtie2

Build Bowtie 2 indices. (Default: off)

--bowtie2-path <path>

The path to the Bowtie 2 executables. (Default: the path to Bowtie 2

executables is assumed to be in the user's PATH environment variable)

--star

Build STAR indices. (Default: off)

--star-path <path>

The path to STAR's executable. (Default: the path to STAR executable

is assumed to be in user's PATH environment variable)

--star-sjdboverhang <int>

Length of the genomic sequence around annotated junction. It is only

used for STAR to build splice junctions database and not needed for

Bowtie or Bowtie2. It will be passed as the --sjdbOverhang option to

STAR. According to STAR's manual, its ideal value is

max(ReadLength)-1, e.g. for 2x101 paired-end reads, the ideal value is

101-1=100. In most cases, the default value of 100 will work as well

as the ideal value. (Default: 100)

--hisat2-hca

Build HISAT2 indices on the transcriptome according to Human Cell

Atlas (HCA) SMART-Seq2 pipeline. (Default: off)

--hisat2-path <path>

The path to the HISAT2 executables. (Default: the path to HISAT2

executables is assumed to be in the user's PATH environment variable)

-p/--num-threads <int>

Number of threads to use for building STAR's genome indices. (Default:

-q/--quiet

Suppress the output of logging information. (Default: off)

-h/--help

Show help information.

PRIOR-ENHANCED RSEM OPTIONS

--prep-pRSEM

A Boolean indicating whether to prepare reference files for pRSEM,

including building Bowtie indices for a genome and selecting training

set isoforms. The index files will be used for aligning ChIP-seq reads

in prior-enhanced RSEM and the training set isoforms will be used for

learning prior. A path to Bowtie executables and a mappability file in

bigWig format are required when this option is on. Currently, Bowtie2

is not supported for prior-enhanced RSEM. (Default: off)

--mappability-bigwig-file <string>

Full path to a whole-genome mappability file in bigWig format. This

file is required for running prior-enhanced RSEM. It is used for

selecting a training set of isoforms for prior-learning. This file can

be either downloaded from UCSC Genome Browser or generated by GEM

(Derrien et al., 2012, PLoS One). (Default: "")

DESCRIPTION

This program extracts/preprocesses the reference sequences for RSEM and

prior-enhanced RSEM. It can optionally build Bowtie indices (with

'--bowtie' option) and/or Bowtie 2 indices (with '--bowtie2' option) using

their default parameters. It can also optionally build STAR indices (with

'--star' option) using parameters from ENCODE3's STAR-RSEM pipeline. For

prior-enhanced RSEM, it can build Bowtie genomic indices and select

training set isoforms (with options '--prep-pRSEM' and

'--mappability-bigwig-file <string>'). If an alternative aligner is to be

used, indices for that particular aligner can be built from either

'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (see OUTPUT for

details). This program is used in conjunction with the

'rsem-calculate-expression' program.

OUTPUT

This program will generate 'reference_name.grp', 'reference_name.ti',

'reference_name.transcripts.fa', 'reference_name.seq',

'reference_name.chrlist' (if '--gtf' is on), 'reference_name.idx.fa',

'reference_name.n2g.idx.fa', optional Bowtie/Bowtie 2 index files, and

optional STAR index files.

'reference_name.grp', 'reference_name.ti', 'reference_name.seq', and

'reference_name.chrlist' are used by RSEM internally.

'reference_name.transcripts.fa' contains the extracted reference

transcripts in Multi-FASTA format. Poly(A) tails are not added and it may

contain lower case bases in its sequences if the corresponding genomic

regions are soft-masked.

'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' are used by

aligners to build their own indices. In these two files, all sequence

bases are converted into upper case. In addition, poly(A) tails are added

if '--polyA' option is set. The only difference between

'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' is that

'reference_name.n2g.idx.fa' in addition converts all 'N' characters to 'G'

characters. This conversion is in particular desired for aligners (e.g.

Bowtie) that do not allow reads to overlap with 'N' characters in the

reference sequences. Otherwise, 'reference_name.idx.fa' should be used to

build the aligner's index files. RSEM uses 'reference_name.idx.fa' to

build Bowtie 2 indices and 'reference_name.n2g.idx.fa' to build Bowtie

indices. For visualizing the transcript-coordinate-based BAM files

generated by RSEM in IGV, 'reference_name.idx.fa' should be imported as a

"genome" (see Visualization section in README.md for details).

If the whole genome is indexed for prior-enhanced RSEM, all the index

files will be generated with prefix as 'reference_name_prsem'. Selected

isoforms for training set are listed in the file

'reference_name_prsem.training_tr_crd'

EXAMPLES

1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version

of the mouse genome. We have downloaded the UCSC Genes transcript

annotations in GTF format (as mm9.gtf) using the Table Browser and the

knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all

chromosome files for mm9 in the directory '/data/mm9'. We want to put the

generated reference files under '/ref' with name 'mouse_0'. We do not add

any poly(A) tails. Please note that GTF files generated from UCSC's Table

Browser do not contain isoform-gene relationship information. For the UCSC

Genes annotation, this information can be obtained from the

knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie

executables are found in '/sw/bowtie'.

There are two ways to write the command:

rsem-prepare-reference --gtf mm9.gtf \

--transcript-to-gene-map knownIsoforms.txt \

--bowtie \

--bowtie-path /sw/bowtie \

/data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \

/ref/mouse_0

rsem-prepare-reference --gtf mm9.gtf \

--transcript-to-gene-map knownIsoforms.txt \

--bowtie \

--bowtie-path /sw/bowtie \

/data/mm9 \

/ref/mouse_0

2) Suppose we also want to build Bowtie 2 indices in the above example and

Bowtie 2 executables are found in '/sw/bowtie2', the command will be:

rsem-prepare-reference --gtf mm9.gtf \

--transcript-to-gene-map knownIsoforms.txt \

--bowtie \

--bowtie-path /sw/bowtie \

--bowtie2 \

--bowtie2-path /sw/bowtie2 \

/data/mm9 \

/ref/mouse_0

3) Suppose we want to build STAR indices in the above example and save

index files under '/ref' with name 'mouse_0'. Assuming STAR executable is

'/sw/STAR', the command will be:

rsem-prepare-reference --gtf mm9.gtf \

--transcript-to-gene-map knownIsoforms.txt \

--star \

--star-path /sw/STAR \

-p 8 \

/data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \

/ref/mouse_0

rsem-prepare-reference --gtf mm9.gtf \

--transcript-to-gene-map knownIsoforms.txt \

--star \

--star-path /sw/STAR \

-p 8 \

/data/mm9

/ref/mouse_0

STAR genome index files will be saved under '/ref/'.

4) Suppose we want to prepare references for prior-enhanced RSEM in the

above example. In this scenario, both STAR and Bowtie are required to

build genomic indices - STAR for RNA-seq reads and Bowtie for ChIP-seq

reads. Assuming their executables are under '/sw/STAR' and '/sw/Bowtie',

respectively. Also, assuming the mappability file for mouse genome is

'/data/mm9.bigWig'. The command will be:

rsem-prepare-reference --gtf mm9.gtf \

--transcript-to-gene-map knownIsoforms.txt \

--star \

--star-path /sw/STAR \

-p 8 \

--prep-pRSEM \

--bowtie-path /sw/Bowtie \

--mappability-bigwig-file /data/mm9.bigWig \

/data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \

/ref/mouse_0

rsem-prepare-reference --gtf mm9.gtf \

--transcript-to-gene-map knownIsoforms.txt \

--star \

--star-path /sw/STAR \

-p 8 \

--prep-pRSEM \

--bowtie-path /sw/Bowtie \

--mappability-bigwig-file /data/mm9.bigWig \

/data/mm9

/ref/mouse_0

Both STAR and Bowtie's index files will be saved under '/ref/'. Bowtie

files will have name prefix 'mouse_0_prsem'

5) Suppose we only have transcripts from EST tags stored in 'mm9.fasta'

and isoform-gene information stored in 'mapping.txt'. We want to add 125bp

long poly(A) tails to all transcripts. The reference_name is set as

'mouse_125'. In addition, we do not want to build Bowtie/Bowtie 2 indices,

and will use an alternative aligner to align reads against either

'mouse_125.idx.fa' or 'mouse_125.idx.n2g.fa':

rsem-prepare-reference --transcript-to-gene-map mapping.txt \

--polyA

mm9.fasta \

mouse_125

> rsem-calculate-expression -h

NAME

rsem-calculate-expression - Estimate gene and isoform expression from

RNA-Seq data.

SYNOPSIS

rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name

rsem-calculate-expression [options] --paired-end upstream_read_file(s) downstream_read_file(s) reference_name sample_name

rsem-calculate-expression [options] --alignments [--paired-end] input reference_name sample_name

ARGUMENTS

upstream_read_files(s)

Comma-separated list of files containing single-end reads or upstream

reads for paired-end data. By default, these files are assumed to be

in FASTQ format. If the --no-qualities option is specified, then FASTA

format is expected.

downstream_read_file(s)

Comma-separated list of files containing downstream reads which are

paired with the upstream reads. By default, these files are assumed to

be in FASTQ format. If the --no-qualities option is specified, then

FASTA format is expected.

input

SAM/BAM/CRAM formatted input file. If "-" is specified for the

filename, the input is instead assumed to come from standard input.

RSEM requires all alignments of the same read group together. For

paired-end reads, RSEM also requires the two mates of any alignment be

adjacent. In addition, RSEM does not allow the SEQ and QUAL fields to

be empty. See Description section for how to make input file obey

RSEM's requirements.

reference_name

The name of the reference used. The user must have run

'rsem-prepare-reference' with this reference_name before running this

program.

sample_name

The name of the sample analyzed. All output files are prefixed by this

name (e.g., sample_name.genes.results)

BASIC OPTIONS

--paired-end

Input reads are paired-end reads. (Default: off)

--no-qualities

Input reads do not contain quality scores. (Default: off)

--strandedness <none|forward|reverse>

This option defines the strandedness of the RNA-Seq reads. It

recognizes three values: 'none', 'forward', and 'reverse'. 'none'

refers to non-strand-specific protocols. 'forward' means all

(upstream) reads are derived from the forward strand. 'reverse' means

all (upstream) reads are derived from the reverse strand. If

'forward'/'reverse' is set, the '--norc'/'--nofw' Bowtie/Bowtie 2

option will also be enabled to avoid aligning reads to the opposite

strand. For Illumina TruSeq Stranded protocols, please use 'reverse'.

(Default: 'none')

-p/--num-threads <int>

Number of threads to use. Both Bowtie/Bowtie2, expression estimation

and 'samtools sort' will use this many threads. (Default: 1)

--alignments

Input file contains alignments in SAM/BAM/CRAM format. The exact file

format will be determined automatically. (Default: off)

--fai <file>

If the header section of input alignment file does not contain

reference sequence information, this option should be turned on.

<file> is a FAI format file containing each reference sequence's name

and length. Please refer to the SAM official website for the details

of FAI format. (Default: off)

--bowtie2

Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM

does not handle indel, local and discordant alignments, the Bowtie2

parameters are set in a way to avoid those alignments. In particular,

we use options '--sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1

--score-min L,0,-0.1' by default. The last parameter of '--score-min',

'-0.1', is the negative of maximum mismatch rate. This rate can be set

by option '--bowtie2-mismatch-rate'. If reads are paired-end, we

additionally use options '--no-mixed' and '--no-discordant'. (Default:

off)

--star

Use STAR to align reads. Alignment parameters are from ENCODE3's

STAR-RSEM pipeline. To save computational time and memory resources,

STAR's Output BAM file is unsorted. It is stored in RSEM's temporary

directory with name as 'sample_name.bam'. Each STAR job will have its

own private copy of the genome in memory. (Default: off)

--hisat2-hca

Use HISAT2 to align reads to the transcriptome according to Human Cell

Atlast SMART-Seq2 pipeline. In particular, we use HISAT parameters "-k

10 --secondary --rg-id=$sampleToken --rg SM:$sampleToken --rg

LB:$sampleToken --rg PL:ILLUMINA --rg PU:$sampleToken --new-summary

--summary-file $sampleName.log --met-file $sampleName.hisat2.met.txt

--met 5 --mp 1,1 --np 1 --score-min L,0,-0.1 --rdg 99999999,99999999

--rfg 99999999,99999999 --no-spliced-alignment --no-softclip --seed

12345". If inputs are paired-end reads, we additionally use parameters

"--no-mixed --no-discordant". (Default: off)

--append-names

If gene_name/transcript_name is available, append it to the end of

gene_id/transcript_id (separated by '_') in files

'sample_name.isoforms.results' and 'sample_name.genes.results'.

(Default: off)

--seed <uint32>

Set the seed for the random number generators used in calculating

posterior mean estimates and credibility intervals. The seed must be a

non-negative 32 bit integer. (Default: off)

--single-cell-prior

By default, RSEM uses Dirichlet(1) as the prior to calculate posterior

mean estimates and credibility intervals. However, much less genes are

expressed in single cell RNA-Seq data. Thus, if you want to compute

posterior mean estimates and/or credibility intervals and you have

single-cell RNA-Seq data, you are recommended to turn on this option.

Then RSEM will use Dirichlet(0.1) as the prior which encourage the

sparsity of the expression levels. (Default: off)

--calc-pme

Run RSEM's collapsed Gibbs sampler to calculate posterior mean

estimates. (Default: off)

--calc-ci

Calculate 95% credibility intervals and posterior mean estimates. The

credibility level can be changed by setting '--ci-credibility-level'.

(Default: off)

-q/--quiet

Suppress the output of logging information. (Default: off)

-h/--help

Show help information.

--version

Show version information.

OUTPUT OPTIONS

--sort-bam-by-read-name

Sort BAM file aligned under transcript coordidate by read name.

Setting this option on will produce deterministic maximum likelihood

estimations from independent runs. Note that sorting will take long

time and lots of memory. (Default: off)

--no-bam-output

Do not output any BAM file. (Default: off)

--sampling-for-bam

When RSEM generates a BAM file, instead of outputting all alignments a

read has with their posterior probabilities, one alignment is sampled

according to the posterior probabilities. The sampling procedure

includes the alignment to the "noise" transcript, which does not

appear in the BAM file. Only the sampled alignment has a weight of 1.

All other alignments have weight 0. If the "noise" transcript is

sampled, all alignments appeared in the BAM file should have weight 0.

(Default: off)

--output-genome-bam

Generate a BAM file, 'sample_name.genome.bam', with alignments mapped

to genomic coordinates and annotated with their posterior

probabilities. In addition, RSEM will call samtools (included in RSEM

package) to sort and index the bam file.

'sample_name.genome.sorted.bam' and

'sample_name.genome.sorted.bam.bai' will be generated. (Default: off)

--sort-bam-by-coordinate

Sort RSEM generated transcript and genome BAM files by coordinates and

build associated indices. (Default: off)

--sort-bam-memory-per-thread <string>

Set the maximum memory per thread that can be used by 'samtools sort'.

<string> represents the memory and accepts suffices 'K/M/G'. RSEM will

pass <string> to the '-m' option of 'samtools sort'. Note that the

default used here is different from the default used by samtools.

(Default: 1G)

ALIGNER OPTIONS

--seed-length <int>

Seed length used by the read aligner. Providing the correct value is

important for RSEM. If RSEM runs Bowtie, it uses this value for

Bowtie's seed length parameter. Any read with its or at least one of

its mates' (for paired-end reads) length less than this value will be

ignored. If the references are not added poly(A) tails, the minimum

allowed value is 5, otherwise, the minimum allowed value is 25. Note

that this script will only check if the value >= 5 and give a warning

message if the value < 25 but >= 5. (Default: 25)

--phred33-quals

Input quality scores are encoded as Phred+33. This option is used by

Bowtie, Bowtie 2 and HISAT2. (Default: on)

--phred64-quals

Input quality scores are encoded as Phred+64 (default for GA Pipeline

ver. >= 1.3). This option is used by Bowtie, Bowtie 2 and HISAT2.

(Default: off)

--solexa-quals

Input quality scores are solexa encoded (from GA Pipeline ver. < 1.3).

This option is used by Bowtie, Bowtie 2 and HISAT2. (Default: off)

--bowtie-path <path>

The path to the Bowtie executables. (Default: the path to the Bowtie

executables is assumed to be in the user's PATH environment variable)

--bowtie-n <int>

(Bowtie parameter) max # of mismatches in the seed. (Range: 0-3,

Default: 2)

--bowtie-e <int>

(Bowtie parameter) max sum of mismatch quality scores across the

alignment. (Default: 99999999)

--bowtie-m <int>

(Bowtie parameter) suppress all alignments for a read if > <int> valid

alignments exist. (Default: 200)

--bowtie-chunkmbs <int>

(Bowtie parameter) memory allocated for best first alignment

calculation (Default: 0 - use Bowtie's default)

--bowtie2-path <path>

(Bowtie 2 parameter) The path to the Bowtie 2 executables. (Default:

the path to the Bowtie 2 executables is assumed to be in the user's

PATH environment variable)

--bowtie2-mismatch-rate <double>

(Bowtie 2 parameter) The maximum mismatch rate allowed. (Default: 0.1)

--bowtie2-k <int>

(Bowtie 2 parameter) Find up to <int> alignments per read. (Default:

200)

--bowtie2-sensitivity-level <string>

(Bowtie 2 parameter) Set Bowtie 2's preset options in --end-to-end

mode. This option controls how hard Bowtie 2 tries to find alignments.

<string> must be one of "very_fast", "fast", "sensitive" and

"very_sensitive". The four candidates correspond to Bowtie 2's

"--very-fast", "--fast", "--sensitive" and "--very-sensitive" options.

(Default: "sensitive" - use Bowtie 2's default)

--star-path <path>

The path to STAR's executable. (Default: the path to STAR executable

is assumed to be in user's PATH environment variable)

--star-gzipped-read-file

(STAR parameter) Input read file(s) is compressed by gzip. (Default:

off)

--star-bzipped-read-file

(STAR parameter) Input read file(s) is compressed by bzip2. (Default:

off)

--star-output-genome-bam

(STAR parameter) Save the BAM file from STAR alignment under genomic

coordinate to 'sample_name.STAR.genome.bam'. This file is NOT sorted

by genomic coordinate. In this file, according to STAR's manual,

'paired ends of an alignment are always adjacent, and multiple

alignments of a read are adjacent as well'. (Default: off)

--hisat2-path <path>

The path to HISAT2's executable. (Default: the path to HISAT2

executable is assumed to be in user's PATH environment variable)

ADVANCED OPTIONS

--tag <string>

The name of the optional field used in the SAM input for identifying a

read with too many valid alignments. The field should have the format

<tagName>:i:<value>, where a <value> bigger than 0 indicates a read

with too many alignments. (Default: "")

--fragment-length-min <int>

Minimum read/insert length allowed. This is also the value for the

Bowtie/Bowtie2 -I option. (Default: 1)

--fragment-length-max <int>

Maximum read/insert length allowed. This is also the value for the

Bowtie/Bowtie 2 -X option. (Default: 1000)

--fragment-length-mean <double>

(single-end data only) The mean of the fragment length distribution,

which is assumed to be a Gaussian. (Default: -1, which disables use of

the fragment length distribution)

--fragment-length-sd <double>

(single-end data only) The standard deviation of the fragment length

distribution, which is assumed to be a Gaussian. (Default: 0, which

assumes that all fragments are of the same length, given by the

rounded value of --fragment-length-mean)

--estimate-rspd

Set this option if you want to estimate the read start position

distribution (RSPD) from data. Otherwise, RSEM will use a uniform

RSPD. (Default: off)

--num-rspd-bins <int>

Number of bins in the RSPD. Only relevant when '--estimate-rspd' is

specified. Use of the default setting is recommended. (Default: 20)

--gibbs-burnin <int>

The number of burn-in rounds for RSEM's Gibbs sampler. Each round

passes over the entire data set once. If RSEM can use multiple

threads, multiple Gibbs samplers will start at the same time and all

samplers share the same burn-in number. (Default: 200)

--gibbs-number-of-samples <int>

The total number of count vectors RSEM will collect from its Gibbs

samplers. (Default: 1000)

--gibbs-sampling-gap <int>

The number of rounds between two succinct count vectors RSEM collects.

If the count vector after round N is collected, the count vector after

round N + <int> will also be collected. (Default: 1)

--ci-credibility-level <double>

The credibility level for credibility intervals. (Default: 0.95)

--ci-memory <int>

Maximum size (in memory, MB) of the auxiliary buffer used for

computing credibility intervals (CI). (Default: 1024)

--ci-number-of-samples-per-count-vector <int>

The number of read generating probability vectors sampled per sampled

count vector. The crebility intervals are calculated by first sampling

P(C | D) and then sampling P(Theta | C) for each sampled count vector.

This option controls how many Theta vectors are sampled per sampled

count vector. (Default: 50)

--keep-intermediate-files

Keep temporary files generated by RSEM. RSEM creates a temporary

directory, 'sample_name.temp', into which it puts all intermediate

output files. If this directory already exists, RSEM overwrites all

files generated by previous RSEM runs inside of it. By default, after

RSEM finishes, the temporary directory is deleted. Set this option to

prevent the deletion of this directory and the intermediate files

inside of it. (Default: off)

--temporary-folder <string>

Set where to put the temporary files generated by RSEM. If the folder

specified does not exist, RSEM will try to create it. (Default:

sample_name.temp)

--time

Output time consumed by each step of RSEM to 'sample_name.time'.

(Default: off)

PRIOR-ENHANCED RSEM OPTIONS

--run-pRSEM

Running prior-enhanced RSEM (pRSEM). Prior parameters, i.e. isoform's

initial pseudo-count for RSEM's Gibbs sampling, will be learned from

input RNA-seq data and an external data set. When pRSEM needs and only

needs ChIP-seq peak information to partition isoforms (e.g. in pRSEM's

default partition model), either ChIP-seq peak file (with the

'--chipseq-peak-file' option) or ChIP-seq FASTQ files for target and

input and the path for Bowtie executables are required (with the

'--chipseq-target-read-files <string>', '--chipseq-control-read-files

<string>', and '--bowtie-path <path> options), otherwise, ChIP-seq

FASTQ files for target and control and the path to Bowtie executables

are required. (Default: off)

--chipseq-peak-file <string>

Full path to a ChIP-seq peak file in ENCODE's narrowPeak, i.e. BED6+4,

format. This file is used when running prior-enhanced RSEM in the

default two-partition model. It partitions isoforms by whether they

have ChIP-seq overlapping with their transcription start site region

or not. Each partition will have its own prior parameter learned from

a training set. This file can be either gzipped or ungzipped.

(Default: "")

--chipseq-target-read-files <string>

Comma-separated full path of FASTQ read file(s) for ChIP-seq target.

This option is used when running prior-enhanced RSEM. It provides

information to calculate ChIP-seq peaks and signals. The file(s) can

be either ungzipped or gzipped with a suffix '.gz' or '.gzip'. The

options '--bowtie-path <path>' and '--chipseq-control-read-files

<string>' must be defined when this option is specified. (Default: "")

--chipseq-control-read-files <string>

Comma-separated full path of FASTQ read file(s) for ChIP-seq conrol.

This option is used when running prior-enhanced RSEM. It provides

information to call ChIP-seq peaks. The file(s) can be either

ungzipped or gzipped with a suffix '.gz' or '.gzip'. The options

'--bowtie-path <path>' and '--chipseq-target-read-files <string>' must

be defined when this option is specified. (Default: "")

--chipseq-read-files-multi-targets <string>

Comma-separated full path of FASTQ read files for multiple ChIP-seq

targets. This option is used when running prior-enhanced RSEM, where

prior is learned from multiple complementary data sets. It provides

information to calculate ChIP-seq signals. All files can be either

ungzipped or gzipped with a suffix '.gz' or '.gzip'. When this option

is specified, the option '--bowtie-path <path>' must be defined and

the option '--partition-model <string>' will be set to 'cmb_lgt'

automatically. (Default: "")

--chipseq-bed-files-multi-targets <string>

Comma-separated full path of BED files for multiple ChIP-seq targets.

This option is used when running prior-enhanced RSEM, where prior is

learned from multiple complementary data sets. It provides information

of ChIP-seq signals and must have at least the first six BED columns.

All files can be either ungzipped or gzipped with a suffix '.gz' or

'.gzip'. When this option is specified, the option '--partition-model

<string>' will be set to 'cmb_lgt' automatically. (Default: "")

--cap-stacked-chipseq-reads

Keep a maximum number of ChIP-seq reads that aligned to the same

genomic interval. This option is used when running prior-enhanced

RSEM, where prior is learned from multiple complementary data sets.

This option is only in use when either

'--chipseq-read-files-multi-targets <string>' or

'--chipseq-bed-files-multi-targets <string>' is specified. (Default:

off)

--n-max-stacked-chipseq-reads <int>

The maximum number of stacked ChIP-seq reads to keep. This option is

used when running prior-enhanced RSEM, where prior is learned from

multiple complementary data sets. This option is only in use when the

option '--cap-stacked-chipseq-reads' is set. (Default: 5)

--partition-model <string>

A keyword to specify the partition model used by prior-enhanced RSEM.

It must be one of the following keywords:

- pk

Partitioned by whether an isoform has a ChIP-seq peak overlapping

with its transcription start site (TSS) region. The TSS region is

defined as [TSS-500bp, TSS+500bp]. For simplicity, we refer this

type of peak as 'TSS peak' when explaining other keywords.

- pk_lgtnopk

First partitioned by TSS peak. Then, for isoforms in the 'no TSS

peak' set, a logistic model is employed to further classify them

into two partitions.

- lm3, lm4, lm5, or lm6

Based on their ChIP-seq signals, isoforms are classified into 3, 4,

5, or 6 partitions by a linear regression model.

- nopk_lm2pk, nopk_lm3pk, nopk_lm4pk, or

nopk_lm5pk

First partitioned by TSS peak. Then, for isoforms in the 'with TSS

peak' set, a linear regression model is employed to further classify

them into 2, 3, 4, or 5 partitions.

- pk_lm2nopk, pk_lm3nopk, pk_lm4nopk, or

pk_lm5nopk

First partitioned by TSS peak. Then, for isoforms in the 'no TSS

peak' set, a linear regression model is employed to further classify

them into 2, 3, 4, or 5 partitions.

- cmb_lgt

Using a logistic regression to combine TSS signals from multiple

complementary data sets and partition training set isoform into

'expressed' and 'not expressed'. This partition model is only in use

when either '--chipseq-read-files-multi-targets <string>' or

'--chipseq-bed-files-multi-targets <string> is specified.

Parameters for all the above models are learned from a training set.

For detailed explanations, please see prior-enhanced RSEM's paper.

(Default: 'pk')

DEPRECATED OPTIONS

The options in this section are deprecated. They are here only for

compatibility reasons and may be removed in future releases.

--sam

Inputs are alignments in SAM format. (Default: off)

--bam

Inputs are alignments in BAM format. (Default: off)

--strand-specific

Equivalent to '--strandedness forward'. (Default: off)

--forward-prob <double>

Probability of generating a read from the forward strand of a

transcript. Set to 1 for a strand-specific protocol where all

(upstream) reads are derived from the forward strand, 0 for a

strand-specific protocol where all (upstream) read are derived from

the reverse strand, or 0.5 for a non-strand-specific protocol.

(Default: off)

DESCRIPTION

In its default mode, this program aligns input reads against a reference

transcriptome with Bowtie and calculates expression values using the

alignments. RSEM assumes the data are single-end reads with quality

scores, unless the '--paired-end' or '--no-qualities' options are

specified. Alternatively, users can use STAR to align reads using the

'--star' option. RSEM has provided options in 'rsem-prepare-reference' to

prepare STAR's genome indices. Users may use an alternative aligner by

specifying '--alignments', and providing an alignment file in SAM/BAM/CRAM

format. However, users should make sure that they align against the

indices generated by 'rsem-prepare-reference' and the alignment file

satisfies the requirements mentioned in ARGUMENTS section.

One simple way to make the alignment file satisfying RSEM's requirements

is to use the 'convert-sam-for-rsem' script. This script accepts

SAM/BAM/CRAM files as input and outputs a BAM file. For example, type the

following command to convert a SAM file, 'input.sam', to a ready-for-use

BAM file, 'input_for_rsem.bam':

convert-sam-for-rsem input.sam input_for_rsem

For details, please refer to 'convert-sam-for-rsem's documentation page.

NOTES

1. Users must run 'rsem-prepare-reference' with the appropriate reference

before using this program.

2. For single-end data, it is strongly recommended that the user provide

the fragment length distribution parameters (--fragment-length-mean and

--fragment-length-sd). For paired-end data, RSEM will automatically learn

a fragment length distribution from the data.

3. Some aligner parameters have default values different from their

original settings.

4. With the '--calc-pme' option, posterior mean estimates will be

calculated in addition to maximum likelihood estimates.

5. With the '--calc-ci' option, 95% credibility intervals and posterior

mean estimates will be calculated in addition to maximum likelihood

estimates.

6. The temporary directory and all intermediate files will be removed when

RSEM finishes unless '--keep-intermediate-files' is specified.

With the '--run-pRSEM' option and associated options (see section

'PRIOR-ENHANCED RSEM OPTIONS' above for details), prior-enhanced RSEM will

be running. Prior parameters will be learned from supplied external data

set(s) and assigned as initial pseudo-counts for isoforms in the

corresponding partition for Gibbs sampling.

OUTPUT

sample_name.isoforms.results

File containing isoform level expression estimates. The first line

contains column names separated by the tab character. The format of

each line in the rest of this file is:

transcript_id gene_id length effective_length expected_count TPM FPKM

IsoPct [posterior_mean_count posterior_standard_deviation_of_count

pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound

TPM_ci_upper_bound TPM_coefficient_of_quartile_variation

FPKM_ci_lower_bound FPKM_ci_upper_bound

FPKM_coefficient_of_quartile_variation]

Fields are separated by the tab character. Fields within "" are

optional. They will not be presented if neither '--calc-pme' nor

'--calc-ci' is set.

'transcript_id' is the transcript name of this transcript. 'gene_id'

is the gene name of the gene which this transcript belongs to (denote

this gene as its parent gene). If no gene information is provided,

'gene_id' and 'transcript_id' are the same.

'length' is this transcript's sequence length (poly(A) tail is not

counted). 'effective_length' counts only the positions that can

generate a valid fragment. If no poly(A) tail is added,

'effective_length' is equal to transcript length - mean fragment

length + 1. If one transcript's effective length is less than 1, this

transcript's both effective length and abundance estimates are set to

'expected_count' is the sum of the posterior probability of each read

comes from this transcript over all reads. Because 1) each read

aligning to this transcript has a probability of being generated from

background noise; 2) RSEM may filter some alignable low quality reads,

the sum of expected counts for all transcript are generally less than

the total number of reads aligned.

'TPM' stands for Transcripts Per Million. It is a relative measure of

transcript abundance. The sum of all transcripts' TPM is 1 million.

'FPKM' stands for Fragments Per Kilobase of transcript per Million

mapped reads. It is another relative measure of transcript abundance.

If we define l_bar be the mean transcript length in a sample, which

can be calculated as

l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every

transcript),

the following equation is hold:

FPKM_i = 10^3 / l_bar * TPM_i.

We can see that the sum of FPKM is not a constant across samples.

'IsoPct' stands for isoform percentage. It is the percentage of this

transcript's abandunce over its parent gene's abandunce. If its parent

gene has only one isoform or the gene information is not provided,

this field will be set to 100.

'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean

estimates calculated by RSEM's Gibbs sampler.

'posterior_standard_deviation_of_count' is the posterior standard

deviation of counts. 'IsoPct_from_pme_TPM' is the isoform percentage

calculated from 'pme_TPM' values.

'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and

'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95%

credibility intervals for TPM and FPKM values. The bounds are

inclusive (i.e. [l, u]).

'TPM_coefficient_of_quartile_variation' and

'FPKM_coefficient_of_quartile_variation' are coefficients of quartile

variation (CQV) for TPM and FPKM values. CQV is a robust way of

measuring the ratio between the standard deviation and the mean. It is

defined as

CQV := (Q3 - Q1) / (Q3 + Q1),

where Q1 and Q3 are the first and third quartiles.

sample_name.genes.results

File containing gene level expression estimates. The first line

contains column names separated by the tab character. The format of

each line in the rest of this file is:

gene_id transcript_id(s) length effective_length expected_count TPM

FPKM [posterior_mean_count posterior_standard_deviation_of_count

pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound

TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound

FPKM_ci_upper_bound FPKM_coefficient_of_quartile_variation]

Fields are separated by the tab character. Fields within "" are

optional. They will not be presented if neither '--calc-pme' nor

'--calc-ci' is set.

'transcript_id(s)' is a comma-separated list of transcript_ids

belonging to this gene. If no gene information is provided, 'gene_id'

and 'transcript_id(s)' are identical (the 'transcript_id').

A gene's 'length' and 'effective_length' are defined as the weighted

average of its transcripts' lengths and effective lengths (weighted by

'IsoPct'). A gene's abundance estimates are just the sum of its

transcripts' abundance estimates.

sample_name.alleles.results

Only generated when the RSEM references are built with allele-specific

transcripts.

This file contains allele level expression estimates for

allele-specific expression calculation. The first line contains column

names separated by the tab character. The format of each line in the

rest of this file is:

allele_id transcript_id gene_id length effective_length expected_count

TPM FPKM AlleleIsoPct AlleleGenePct [posterior_mean_count

posterior_standard_deviation_of_count pme_TPM pme_FPKM

AlleleIsoPct_from_pme_TPM AlleleGenePct_from_pme_TPM

TPM_ci_lower_bound TPM_ci_upper_bound

TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound

FPKM_ci_upper_bound FPKM_coefficient_of_quartile_variation]

Fields are separated by the tab character. Fields within "[]" are

optional. They will not be presented if neither '--calc-pme' nor

'--calc-ci' is set.

'allele_id' is the allele-specific name of this allele-specific

transcript.

'AlleleIsoPct' stands for allele-specific percentage on isoform level.

It is the percentage of this allele-specific transcript's abundance

over its parent transcript's abundance. If its parent transcript has

only one allele variant form, this field will be set to 100.

'AlleleGenePct' stands for allele-specific percentage on gene level.

It is the percentage of this allele-specific transcript's abundance

over its parent gene's abundance.

'AlleleIsoPct_from_pme_TPM' and 'AlleleGenePct_from_pme_TPM' have

similar meanings. They are calculated based on posterior mean

estimates.

Please note that if this file is present, the fields 'length' and

'effective_length' in 'sample_name.isoforms.results' should be

interpreted similarly as the corresponding definitions in

'sample_name.genes.results'.

sample_name.transcript.bam

Only generated when --no-bam-output is not specified.

'sample_name.transcript.bam' is a BAM-formatted file of read

alignments in transcript coordinates. The MAPQ field of each alignment

is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the

posterior probability of that alignment being the true mapping of a

read. In addition, RSEM pads a new tag ZW:f:value, where value is a

single precision floating number representing the posterior

probability. Because this file contains all alignment lines produced

by bowtie or user-specified aligners, it can also be used as a

replacement of the aligner generated BAM/SAM file.

sample_name.transcript.sorted.bam and

sample_name.transcript.sorted.bam.bai

Only generated when --no-bam-output is not specified and

--sort-bam-by-coordinate is specified.

'sample_name.transcript.sorted.bam' and

'sample_name.transcript.sorted.bam.bai' are the sorted BAM file and

indices generated by samtools (included in RSEM package).

sample_name.genome.bam

Only generated when --no-bam-output is not specified and

--output-genome-bam is specified.

'sample_name.genome.bam' is a BAM-formatted file of read alignments in

genomic coordinates. Alignments of reads that have identical genomic

coordinates (i.e., alignments to different isoforms that share the

same genomic region) are collapsed into one alignment. The MAPQ field

of each alignment is set to min(100, floor(-10 * log10(1.0 - w) +

0.5)), where w is the posterior probability of that alignment being

the true mapping of a read. In addition, RSEM pads a new tag

ZW:f:value, where value is a single precision floating number

representing the posterior probability. If an alignment is spliced, a

XS:A:value tag is also added, where value is either '+' or '-'

indicating the strand of the transcript it aligns to.

sample_name.genome.sorted.bam and

sample_name.genome.sorted.bam.bai

Only generated when --no-bam-output is not specified, and

--sort-bam-by-coordinate and --output-genome-bam are specified.

'sample_name.genome.sorted.bam' and

'sample_name.genome.sorted.bam.bai' are the sorted BAM file and

indices generated by samtools (included in RSEM package).

sample_name.time

Only generated when --time is specified.

It contains time (in seconds) consumed by aligning reads, estimating

expression levels and calculating credibility intervals.

sample_name.log

Only generated when --alignments is not specified.

It captures alignment statistics outputted from the user-specified

aligner.

sample_name.stat

This is a folder instead of a file. All model related statistics are

stored in this folder. Use 'rsem-plot-model' can generate plots using

this folder.

'sample_name.stat/sample_name.cnt' contains alignment statistics. The

format and meanings of each field are described in

'cnt_file_description.txt' under RSEM directory.

'sample_name.stat/sample_name.model' stores RNA-Seq model parameters

learned from the data. The format and meanings of each filed of this

file are described in 'model_file_description.txt' under RSEM

directory.

The following four output files will be generated only by

prior-enhanced RSEM

- 'sample_name.stat/sample_name_prsem.all_tr_features'

It stores isofrom features for deriving and assigning pRSEM prior.

The first line is a header and the rest is one isoform per line. The

description for each column is:

* trid: transcript ID from input annotation

* geneid: gene ID from input anntation

* chrom: isoform's chromosome name

* strand: isoform's strand name

* start: isoform's end with the lowest genomic loci

* end: isoform's end with the highest genomic loci

* tss_mpp: average mappability of [TSS-500bp, TSS+500bp], where TSS

is isoform's transcription start site, i.e. 5'-end

* body_mpp: average mappability of (TSS+500bp, TES-500bp), where TES

is isoform's transcription end site, i.e. 3'-end

* tes_mpp: average mappability of [TES-500bp, TES+500bp]

* pme_count: isoform's fragment or read count from RSEM's posterior

mean estimates

* tss: isoform's TSS loci

* tss_pk: equal to 1 if isoform's [TSS-500bp, TSS+500bp] region

overlaps with a RNA Pol II peak; 0 otherwise

* is_training: equal to 1 if isoform is in the training set where

Pol II prior is learned; 0 otherwise

- 'sample_name.stat/sample_name_prsem.all_tr_prior'

It stores prior parameters for every isoform. This file does not

have a header. Each line contains a prior parameter and an isoform's

transcript ID delimited by ` # `.

- 'sample_name.stat/sample_name_uniform_prior_1.isoforms.results'

RSEM's posterior mean estimates on the isoform level with an initial

pseudo-count of one for every isoform. It is in the same format as

the 'sample_name.isoforms.results'.

- 'sample_name.stat/sample_name_uniform_prior_1.genes.results'

RSEM's posterior mean estimates on the gene level with an initial

pseudo-count of one for every isoform. It is in the same format as

the 'sample_name.genes.results'.

When learning prior from multiple external data sets in prior-enhanced

RSEM, two additional output files will be generated.

- 'sample_name.stat/sample_name.pval_LL'

It stores a p-value and a log-likelihood. The p-value indicates

whether the combination of multiple complementary data sets is

informative for RNA-seq quantification. The log-likelihood shows how

well pRSEM's Dirichlet-multinomial model fits the read counts of

partitioned training set isoforms.

- 'sample_name.stat/sample_name.lgt_mdl.RData'

It stores an R object named 'glmmdl', which is a logistic regression

model on the training set isoforms and multiple external data sets.

In addition, extra columns will be added to

'sample_name.stat/all_tr_features'

* is_expr: equal to 1 if isoform has an abundance >= 1 TPM and a

non-zero read count from RSEM's posterior mean estimates; 0

otherwise

* "$external_data_set_basename": log10 of external data's signal at

[TSS-500, TSS+500]. Signal is the number of reads aligned within

that interval and normalized to RPKM by read depth and interval

length. It will be set to -4 if no read aligned to that interval.

There are multiple columns like this one, where each represents an

external data set.

* prd_expr_prob: predicted probability from logistic regression model

on whether this isoform is expressed or not. A probability higher

than 0.5 is considered as expressed

* partition: group index, to which this isoforms is partitioned

* prior: prior parameter for this isoform

EXAMPLES

Assume the path to the bowtie executables is in the user's PATH

environment variable. Reference files are under '/ref' with name

'mouse_125'.

1) '/data/mmliver.fq', single-end reads with quality scores. Quality

scores are encoded as for 'GA pipeline version >= 1.3'. We want to use 8

threads and generate a genome BAM file. In addition, we want to append

gene/transcript names to the result files:

rsem-calculate-expression --phred64-quals \

-p 8 \

--append-names \

--output-genome-bam \

/data/mmliver.fq \

/ref/mouse_125 \

mmliver_single_quals

2) '/data/mmliver_1.fq' and '/data/mmliver_2.fq', stranded paired-end

reads with quality scores. Suppose the library is prepared using TruSeq

Stranded Kit, which means the first mate should map to the reverse strand.

Quality scores are in SANGER format. We want to use 8 threads and do not

generate a genome BAM file:

rsem-calculate-expression -p 8 \

--paired-end \

--strandedness reverse \

/data/mmliver_1.fq \

/data/mmliver_2.fq \

/ref/mouse_125 \

mmliver_paired_end_quals

3) '/data/mmliver.fa', single-end reads without quality scores. We want to

use 8 threads:

rsem-calculate-expression -p 8 \

--no-qualities \

/data/mmliver.fa \

/ref/mouse_125 \

mmliver_single_without_quals

4) Data are the same as 1). This time we assume the bowtie executables are

under '/sw/bowtie'. We want to take a fragment length distribution into

consideration. We set the fragment length mean to 150 and the standard

deviation to 35. In addition to a BAM file, we also want to generate

credibility intervals. We allow RSEM to use 1GB of memory for CI

calculation:

rsem-calculate-expression --bowtie-path /sw/bowtie \

--phred64-quals \

--fragment-length-mean 150.0 \

--fragment-length-sd 35.0 \

-p 8 \

--output-genome-bam \

--calc-ci \

--ci-memory 1024 \

/data/mmliver.fq \

/ref/mouse_125 \

mmliver_single_quals

5) '/data/mmliver_paired_end_quals.bam', BAM-formatted alignments for

paired-end reads with quality scores. We want to use 8 threads:

rsem-calculate-expression --paired-end \

--alignments \

-p 8 \

/data/mmliver_paired_end_quals.bam \

/ref/mouse_125 \

mmliver_paired_end_quals

6) '/data/mmliver_1.fq.gz' and '/data/mmliver_2.fq.gz', paired-end reads

with quality scores and read files are compressed by gzip. We want to use

STAR to aligned reads and assume STAR executable is '/sw/STAR'. Suppose we

want to use 8 threads and do not generate a genome BAM file:

rsem-calculate-expression --paired-end \

--star \

--star-path /sw/STAR \

--gzipped-read-file \

--paired-end \

-p 8 \

/data/mmliver_1.fq.gz \

/data/mmliver_2.fq.gz \

/ref/mouse_125 \

mmliver_paired_end_quals

7) In the above example, suppose we want to run prior-enhanced RSEM

instead. Assuming we want to learn priors from a ChIP-seq peak file

'/data/mmlive.narrowPeak.gz':

rsem-calculate-expression --star \

--star-path /sw/STAR \

--gzipped-read-file \

--paired-end \

--calc-pme \

--run-pRSEM \

--chipseq-peak-file /data/mmliver.narrowPeak.gz \

-p 8 \

/data/mmliver_1.fq.gz \

/data/mmliver_2.fq.gz \

/ref/mouse_125 \

mmliver_paired_end_quals

8) Similar to the example in 7), suppose we want to use the partition

model 'pk_lm2nopk' (partitioning isoforms by Pol II TSS peak first and

then partitioning 'no TSS peak' isoforms into two bins by a linear

regression model), and we want to partition isoforms by RNA Pol II's

ChIP-seq read files '/data/mmliver_PolIIRep1.fq.gz' and

'/data/mmliver_PolIIRep2.fq.gz', and the control ChIP-seq read files

'/data/mmliver_ChIPseqCtrl.fq.gz'. Also, assuming Bowtie's executables are

under '/sw/bowtie/':

rsem-calculate-expression --star \

--star-path /sw/STAR \

--gzipped-read-file \

--paired-end \

--calc-pme \

--run-pRSEM \

--chipseq-target-read-files /data/mmliver_PolIIRep1.fq.gz,/data/mmliver_PolIIRep2.fq.gz \

--chipseq-control-read-files /data/mmliver_ChIPseqCtrl.fq.gz \

--partition-model pk_lm2nopk \

--bowtie-path /sw/bowtie \

-p 8 \

/data/mmliver_1.fq.gz \

/data/mmliver_2.fq.gz \

/ref/mouse_125 \

mmliver_paired_end_quals

9) Similar to the example in 8), suppose we want to derive prior from four

histone modification ChIP-seq read data sets: '/data/H3K27Ac.fastq.gz',

'/data/H3K4me1.fastq.gz', '/data/H3K4me2.fastq.gz', and

'/data/H3K4me3.fastq.gz'. Also, assuming Bowtie's executables are under

'/sw/bowtie/':

rsem-calculate-expression --star \

--star-path /sw/STAR \

--gzipped-read-file \

--paired-end \

--calc-pme \

--run-pRSEM \

--partition-model cmb_lgt \

--chipseq-read-files-multi-targets /data/H3K27Ac.fastq.gz,/data/H3K4me1.fastq.gz,/data/H3K4me2.fastq.gz,/data/H3K4me3.fastq.gz \

--bowtie-path /sw/bowtie \

-p 8 \

/data/mmliver_1.fq.gz \

/data/mmliver_2.fq.gz \

/ref/mouse_125 \

mmliver_paired_end_quals

> rsem-generate-data-matrix

$ rsem-generate-data-matrix

Usage: rsem-generate-data-matrix sampleA.[alleles/genes/isoforms].results sampleB.[alleles/genes/isoforms].results ... > output_name.matrix

All result files should have the same file type. The 'expected_count' columns of every result file are extracted to form the data matrix.

他にも多くのコマンドがある。

> rsem-

$ rsem-

rsem-bam2readdepth

rsem-extract-reference-transcripts

rsem-get-unique

rsem-preref

rsem-scan-for-paired-end-reads rsem-bam2wig rsem-for-ebseq-calculate-clustering-info

rsem-gff3-to-gtf

rsem-refseq-extract-primary-assembly

rsem-simulate-reads

rsem-build-read-index

rsem-for-ebseq-find-DE

rsem-parse-alignments

rsem-run-ebseq

rsem-synthesis-reference-transcripts

rsem-calculate-credibility-intervals

rsem-for-ebseq-generate-ngvector-from-clustering-info
rsem-plot-model

rsem-run-em

rsem-tbam2gbam

rsem-calculate-expression

rsem-gen-transcript-plots

rsem-plot-transcript-wiggles

rsem-run-gibbs

rsem-control-fdr

rsem-generate-data-matrix

rsem-prepare-reference

rsem-sam-validator

実行方法

1、indexing

GTF ファイルとゲノムのfasta、最後にindex名を指定する（１と同じにする）。ランの過程でbowtie2/star/hisat2のindexも作成される（この例ではxxx.index.〜）。--gff3を使えばGTFの代わりにGFF3のアノテーションを与えることもできる。

#bowtie2
rsem-prepare-reference --gtf genome.gtf --bowtie2 --bowtie2-path <path>/<to>/<your>/<bowtie2-path> -p 20 genome.fa bowtie2_index

#star
rsem-prepare-reference --gtf genome.gtf --star --star-path <path>/<to>/<your>/<star-path> -p 20 genome.fa star_index

#hisat2
rsem-prepare-reference --gtf genome.gtf --hisat2-hca --hisat2-path <path>/<to>/<your>/<hisat2-path> -p 20 genome.fa hisat2_index

--gtf If this option is on, RSEM assumes that 'reference_fasta_file(s)' contains the sequence of a genome, and will extract transcript reference sequences using the gene annotations specified in <file>, which should be in GTF format.
If this and '--gff3' options are off, RSEM will assume 'reference_fasta_file(s)' contains the reference transcripts. In this case, RSEM assumes that name of each sequence in the Multi-FASTA files is its transcript_id. (Default: off)
--gff3 The annotation file is in GFF3 format instead of GTF format. RSEM will first convert it to GTF format with the file name 'reference_name.gtf'. Please make sure that 'reference_name.gtf' does not exist. (Default: off)
--bowtie2 Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM
does not handle indel, local and discordant alignments, the Bowtie2 parameters are set in a way to avoid those alignments. In particular, we use options '- sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1' by default. The last parameter of '--score-min', '-0.1', is the negative of maximum mismatch rate. This rate can be set by option '--bowtie2-mismatch-rate'. If reads are paired-end, we additionally use options '--no-mixed' and '--no-discordant'. (Default: off)
--star Use STAR to align reads. Alignment parameters are from ENCODE3's
STAR-RSEM pipeline. To save computational time and memory resources,
STAR's Output BAM file is unsorted. It is stored in RSEM's temporary
directory with name as 'sample_name.bam'. Each STAR job will have
its own private copy of the genome in memory. (Default: off)
--hisat2-hca Use HISAT2 to align reads to the transcriptome according to Human. Cell Atlast SMART-Seq2 pipeline. In particular, we use HISAT
parameters "-k 10 --secondary --rg-id=$sampleToken --rg
SM:$sampleToken --rg LB:$sampleToken --rg PL:ILLUMINA --rg
PU:$sampleToken --new-summary --summary-file $sampleName.log
--met-file $sampleName.hisat2.met.txt --met 5 --mp 1,1 --np 1
--score-min L,0,-0.1 --rdg 99999999,99999999 --rfg 99999999,99999999
--no-spliced-alignment --no-softclip --seed 12345". If inputs are
paired-end reads, we additionally use parameters "--no-mixed
--no-discordant". (Default: off)
--transcript-to-gene-map Use information from <file> to map from transcript (isoform) ids to gene ids. Each line of <file> should be of the form: gene_id transcript_id with the two fields separated by a tab character. If you are using a GTF file for the "UCSC Genes" gene set from the UCSC Genome Browser, then the "knownIsoforms.txt" file (obtained from the "Downloads" section of the UCSC Genome Browser site) is of this format. If this option is off, then the mapping of isoforms to genes depends on whether the '--gtf' option is specified. If '--gtf' is specified, then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF file. Otherwise, RSEM assumes that each sequence in the reference sequence files is a separate gene. (Default: off)
--prep-pRSEM A Boolean indicating whether to prepare reference files for pRSEM, including building Bowtie indices for a genome and selecting training set isoforms. The index files will be used for aligning ChIP-seq reads in prior-enhanced RSEM and the training set isoforms will be used for learning prior. A path to Bowtie executables and a mappability file in bigWig format are required when this option is on. Currently, Bowtie2 is not supported for prior-enhanced RSEM. (Default: off)
--mappability-bigwig-file Full path to a whole-genome mappability file in bigWig format. This file is required for running prior-enhanced RSEM. It is used for selecting a training set of isoforms for prior-learning. This file can be either downloaded from UCSC Genome Browser or generated by GEM (Derrien et al., 2012, PLoS One). (Default: "")
--strandedness <none|forward|reverse> This option defines the strandedness of the RNA-Seq reads. It recognizes three values: 'none', 'forward', and 'reverse'. 'none' refers to non-strand-specific protocols. 'forward' means all (upstream) reads are derived from the forward strand. 'reverse' means all (upstream) reads are derived from the reverse strand. If 'forward'/'reverse' is set, the '--norc'/'--nofw' Bowtie/Bowtie 2 option will also be enabled to avoid aligning reads to the opposite strand. For Illumina TruSeq Stranded protocols, please use 'reverse'. (Default: 'none')

インデックスxxx.index~が出力される。注意点として、STARを使う場合はメモリが多い環境で実行すること（32GB程度では足りないことが多い）。メモリが少ない環境でSTAR indexを作成すると、genomeParameterファイルが出力されない事がある。このファイルがないと次のステップでエラーを起こす。

”--transcript-to-gene-map”についてはこちらを参照。

How to get --transcript-to-gene-map <file> in RSEM?

２、rsem-calculate

オプションの後にペアエンドfastq、index名（bowtie2の場合、step1のコマンドでbowtie2.indexと名前をつけているので”bowtie2.index”と指定）、出力ファイル名を指定する。gzipped fastqは使えないので解凍して指定するか"--gzipped-read-file"オプションをつける（注２）。シングルエンドなら--paired-endを外す。

#bowtie2
rsem-calculate-expression --paired-end -p 20 --bowtie2 --bowtie2-path <path>/<to>/<your>/<bowtie2-path> \
 sample1_R1.fq sample1_R2.fq bowtie2_index sample1

#star
rsem-calculate-expression --paired-end -p 20 --star --star-path <path>/<to>/<your>/<star-path> \
 sample1_R1.fq sample1_R2.fq star_index sample1

#hisat2
rsem-calculate-expression --paired-end -p 20 --hisat2-hca --hisat2-path <path>/<to>/<your>/<hisat2-path> \
 sample1_R1.fq sample1_R2.fq hisat2_index sample1

sample1.genes.resultsとsample1.isofomrs.resultsその他のファイルが出力される。

bamが最後の方で生じ、ディスクに保存される。このリファレンスはゲノムでは無く転写産物リファレンスである。転写産物をリファレンスにしてマッピングしているので、IGV等のビューアで見る際には間違えないようにしたい。STARならstar_index.transcripts.faになる（別の見方をすれば、転写産物に当てているので、bowtieではスプリットアラインメントはどうなっているのかとか、hisat2 等のスプリットアラインメントの距離の最大値等は気にしなくて良い）。

３、複数の結果を統合　（expected countの統合*１）

#gene level
rsem-generate-data-matrix sample*genes.results > output

#transcripts(isoform) level
rsem-generate-data-matrix sample*isoforms.results > output

expected read count（RSEMの確率的リードカウント値で正規化されていない生の値）の表が出力される。先頭行の名前だけ修正してiDEP（修正）などにロードすれば、すぐに結果を得ることができる。その時はまずPCAなどを行なって、妥当なグループになっているかどうか確認すること。

その他

マッピングを別途実行し、その結果のbamファイルRSEMに提供することもできる。その場合はアラインメントオプション、特にリピートのマッピング設定に注意する。

例えばSTARの場合

１、indexing and mapping

#index
STAR --runMode genomeGenerate --genomeDir STAR_index --genomeFastaFiles genome.fa --sjdbGTFfile annotation.gtf

#mapping
STAR --runThreadN 12 --genomeDir STAR_index --readFilesIn pair_R1.fq.gz pair_R2.fq.gz --readFilesCommand zcat --quantMode TranscriptomeSAM --outSAMtype BAM SortedByCoordinate --genomeLoad NoSharedMemory --outFilterMultimapNmax 1 --outFileNamePrefix sample1

２、read count

bamファイルを指定する際は--alignmentsのフラグを立てる。

rsem-calculate-expression --alignments --paired-end -p 20 --bowtie2 input.bam bowtie2_index sample1

--alignments Input file contains alignments in SAM/BAM/CRAM format. The exact file format will be determined automatically. (Default: off)

引用

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

Bo Li, Colin N Dewey

BMC Bioinformatics. 2011 Aug 4

参考

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

転写産物レベルで正確なリードカウントを行う RSEM