macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

転写産物レベルで正確なリードカウントを行う RSEM

2021 1/9  タイトル修正

2021 1/15 コマンドと説明追記

2021 4/27 ベンチマーク論文追加2021 10/8

2021 10/8 gzipped fastqのオプション追記

 

 RNA-Seqは転写産物の量を測定する方法に革命を起こしている。RNA-Seqデータからのトランスクリプト定量における重要な課題は、複数の遺伝子やアイソフォームにマップされたリードの取り扱いである。この問題は、配列決定されたゲノムがない場合のde novoトランスクリプトームアセンブリを用いた定量化において特に重要であり、どのトランスクリプトが同じ遺伝子のアイソフォームであるかを決定することは困難である。第二の重要な問題は、RNA-Seq実験のデザインであり、リード数、リードの長さ、リードがcDNA断片の片方または両方の末端から来るかどうかという点である。
 本研究では、シングルエンドまたはペアエンドのRNA-Seqデータから遺伝子とアイソフォームのアバンダンスを定量化するためのユーザーフレンドリーなソフトウェアパッケージであるRSEMを紹介する。RSEMは、アバンダンス推定値、95%信頼区間、可視化ファイルを出力し、RNA-Seqデータのシミュレーションも可能である。他の既存のツールとは対照的に、このソフトウェアはリファレンスゲノムを必要としない。したがって、de novoトランスクリプトームアセンブラと組み合わせることで、RSEMはゲノム配列のない種の正確なトランスクリプト定量を可能にする。シミュレーションおよび実データセットにおいて、RSEMはリファレンスゲノムに依存した定量法と比較して優れた性能を有している。また、RSEMが曖昧にマッピングされたリードを効果的に使用できることを利用して、ゲノムレベルの正確なアバンダンス推定値は、ショートシングルエンドリードを大量に使用した場合に最もよく得られることを示した。一方、単一遺伝子内のアイソフォームの相対的な頻度の推定は、各遺伝子の可能なスプライス形態の数に応じて、ペアエンドリードを使用することによって改善される可能性がある。
 RSEMは、RNA-Seqデータから転写産物の量を定量するための正確で使いやすいソフトウェアツールである。基準となるゲノムの存在に依存しないため、特にde novo転写産物アセンブリ定量化に有用である。また、現在では比較的高価なRNA-Seqを用いた定量実験をコスト効率よく設計するための貴重な指針となっている。

 

 

インストール

Github

#bioconda (link)
mamba create -n rsem -y python=3.8
conda activate rsem
mamba install -c bioconda rsem -y

> rsem-prepare-reference -h 

NAME

    rsem-prepare-reference - Prepare transcript references for RSEM and

    optionally build BOWTIE/BOWTIE2/STAR/HISAT2(transcriptome) indices.

 

SYNOPSIS

     rsem-prepare-reference [options] reference_fasta_file(s) reference_name

 

ARGUMENTS

    reference_fasta_file(s)

        Either a comma-separated list of Multi-FASTA formatted files OR a

        directory name. If a directory name is specified, RSEM will read all

        files with suffix ".fa" or ".fasta" in this directory. The files

        should contain either the sequences of transcripts or an entire

        genome, depending on whether the '--gtf' option is used.

 

    reference name

        The name of the reference used. RSEM will generate several

        reference-related files that are prefixed by this name. This name can

        contain path information (e.g. '/ref/mm9').

 

OPTIONS

    --gtf <file>

        If this option is on, RSEM assumes that 'reference_fasta_file(s)'

        contains the sequence of a genome, and will extract transcript

        reference sequences using the gene annotations specified in <file>,

        which should be in GTF format.

 

        If this and '--gff3' options are off, RSEM will assume

        'reference_fasta_file(s)' contains the reference transcripts. In this

        case, RSEM assumes that name of each sequence in the Multi-FASTA files

        is its transcript_id.

 

        (Default: off)

 

    --gff3 <file>

        The annotation file is in GFF3 format instead of GTF format. RSEM will

        first convert it to GTF format with the file name

        'reference_name.gtf'. Please make sure that 'reference_name.gtf' does

        not exist. (Default: off)

 

    --gff3-RNA-patterns <pattern>

        <pattern> is a comma-separated list of transcript categories, e.g.

        "mRNA,rRNA". Only transcripts that match the <pattern> will be

        extracted. (Default: "mRNA")

 

    --gff3-genes-as-transcripts

        This option is designed for untypical organisms, such as viruses,

        whose GFF3 files only contain genes. RSEM will assume each gene as a

        unique transcript when it converts the GFF3 file into GTF format.

 

    --trusted-sources <sources>

        <sources> is a comma-separated list of trusted sources, e.g.

        "ENSEMBL,HAVANA". Only transcripts coming from these sources will be

        extracted. If this option is off, all sources are accepted. (Default:

        off)

 

    --transcript-to-gene-map <file>

        Use information from <file> to map from transcript (isoform) ids to

        gene ids. Each line of <file> should be of the form:

 

        gene_id transcript_id

 

        with the two fields separated by a tab character.

 

        If you are using a GTF file for the "UCSC Genes" gene set from the

        UCSC Genome Browser, then the "knownIsoforms.txt" file (obtained from

        the "Downloads" section of the UCSC Genome Browser site) is of this

        format.

 

        If this option is off, then the mapping of isoforms to genes depends

        on whether the '--gtf' option is specified. If '--gtf' is specified,

        then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF

        file. Otherwise, RSEM assumes that each sequence in the reference

        sequence files is a separate gene.

 

        (Default: off)

 

    --allele-to-gene-map <file>

        Use information from <file> to provide gene_id and transcript_id

        information for each allele-specific transcript. Each line of <file>

        should be of the form:

 

        gene_id transcript_id allele_id

 

        with the fields separated by a tab character.

 

        This option is designed for quantifying allele-specific expression. It

        is only valid if '--gtf' option is not specified. allele_id should be

        the sequence names presented in the Multi-FASTA-formatted files.

 

        (Default: off)

 

    --polyA

        Add poly(A) tails to the end of all reference isoforms. The length of

        poly(A) tail added is specified by '--polyA-length' option. STAR

        aligner users may not want to use this option. (Default: do not add

        poly(A) tail to any of the isoforms)

 

    --polyA-length <int>

        The length of the poly(A) tails to be added. (Default: 125)

 

    --no-polyA-subset <file>

        Only meaningful if '--polyA' is specified. Do not add poly(A) tails to

        those transcripts listed in <file>. <file> is a file containing a list

        of transcript_ids. (Default: off)

 

    --bowtie

        Build Bowtie indices. (Default: off)

 

    --bowtie-path <path>

        The path to the Bowtie executables. (Default: the path to Bowtie

        executables is assumed to be in the user's PATH environment variable)

 

    --bowtie2

        Build Bowtie 2 indices. (Default: off)

 

    --bowtie2-path <path>

        The path to the Bowtie 2 executables. (Default: the path to Bowtie 2

        executables is assumed to be in the user's PATH environment variable)

 

    --star

        Build STAR indices. (Default: off)

 

    --star-path <path>

        The path to STAR's executable. (Default: the path to STAR executable

        is assumed to be in user's PATH environment variable)

 

    --star-sjdboverhang <int>

        Length of the genomic sequence around annotated junction. It is only

        used for STAR to build splice junctions database and not needed for

        Bowtie or Bowtie2. It will be passed as the --sjdbOverhang option to

        STAR. According to STAR's manual, its ideal value is

        max(ReadLength)-1, e.g. for 2x101 paired-end reads, the ideal value is

        101-1=100. In most cases, the default value of 100 will work as well

        as the ideal value. (Default: 100)

 

    --hisat2-hca

        Build HISAT2 indices on the transcriptome according to Human Cell

        Atlas (HCA) SMART-Seq2 pipeline. (Default: off)

 

    --hisat2-path <path>

        The path to the HISAT2 executables. (Default: the path to HISAT2

        executables is assumed to be in the user's PATH environment variable)

 

    -p/--num-threads <int>

        Number of threads to use for building STAR's genome indices. (Default:

        1)

 

    -q/--quiet

        Suppress the output of logging information. (Default: off)

 

    -h/--help

        Show help information.

 

PRIOR-ENHANCED RSEM OPTIONS

    --prep-pRSEM

        A Boolean indicating whether to prepare reference files for pRSEM,

        including building Bowtie indices for a genome and selecting training

        set isoforms. The index files will be used for aligning ChIP-seq reads

        in prior-enhanced RSEM and the training set isoforms will be used for

        learning prior. A path to Bowtie executables and a mappability file in

        bigWig format are required when this option is on. Currently, Bowtie2

        is not supported for prior-enhanced RSEM. (Default: off)

 

    --mappability-bigwig-file <string>

        Full path to a whole-genome mappability file in bigWig format. This

        file is required for running prior-enhanced RSEM. It is used for

        selecting a training set of isoforms for prior-learning. This file can

        be either downloaded from UCSC Genome Browser or generated by GEM

        (Derrien et al., 2012, PLoS One). (Default: "")

 

DESCRIPTION

    This program extracts/preprocesses the reference sequences for RSEM and

    prior-enhanced RSEM. It can optionally build Bowtie indices (with

    '--bowtie' option) and/or Bowtie 2 indices (with '--bowtie2' option) using

    their default parameters. It can also optionally build STAR indices (with

    '--star' option) using parameters from ENCODE3's STAR-RSEM pipeline. For

    prior-enhanced RSEM, it can build Bowtie genomic indices and select

    training set isoforms (with options '--prep-pRSEM' and

    '--mappability-bigwig-file <string>'). If an alternative aligner is to be

    used, indices for that particular aligner can be built from either

    'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (see OUTPUT for

    details). This program is used in conjunction with the

    'rsem-calculate-expression' program.

 

OUTPUT

    This program will generate 'reference_name.grp', 'reference_name.ti',

    'reference_name.transcripts.fa', 'reference_name.seq',

    'reference_name.chrlist' (if '--gtf' is on), 'reference_name.idx.fa',

    'reference_name.n2g.idx.fa', optional Bowtie/Bowtie 2 index files, and

    optional STAR index files.

 

    'reference_name.grp', 'reference_name.ti', 'reference_name.seq', and

    'reference_name.chrlist' are used by RSEM internally.

 

    'reference_name.transcripts.fa' contains the extracted reference

    transcripts in Multi-FASTA format. Poly(A) tails are not added and it may

    contain lower case bases in its sequences if the corresponding genomic

    regions are soft-masked.

 

    'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' are used by

    aligners to build their own indices. In these two files, all sequence

    bases are converted into upper case. In addition, poly(A) tails are added

    if '--polyA' option is set. The only difference between

    'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' is that

    'reference_name.n2g.idx.fa' in addition converts all 'N' characters to 'G'

    characters. This conversion is in particular desired for aligners (e.g.

    Bowtie) that do not allow reads to overlap with 'N' characters in the

    reference sequences. Otherwise, 'reference_name.idx.fa' should be used to

    build the aligner's index files. RSEM uses 'reference_name.idx.fa' to

    build Bowtie 2 indices and 'reference_name.n2g.idx.fa' to build Bowtie

    indices. For visualizing the transcript-coordinate-based BAM files

    generated by RSEM in IGV, 'reference_name.idx.fa' should be imported as a

    "genome" (see Visualization section in README.md for details).

 

    If the whole genome is indexed for prior-enhanced RSEM, all the index

    files will be generated with prefix as 'reference_name_prsem'. Selected

    isoforms for training set are listed in the file

    'reference_name_prsem.training_tr_crd'

 

EXAMPLES

    1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version

    of the mouse genome. We have downloaded the UCSC Genes transcript

    annotations in GTF format (as mm9.gtf) using the Table Browser and the

    knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all

    chromosome files for mm9 in the directory '/data/mm9'. We want to put the

    generated reference files under '/ref' with name 'mouse_0'. We do not add

    any poly(A) tails. Please note that GTF files generated from UCSC's Table

    Browser do not contain isoform-gene relationship information. For the UCSC

    Genes annotation, this information can be obtained from the

    knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie

    executables are found in '/sw/bowtie'.

 

    There are two ways to write the command:

 

     rsem-prepare-reference --gtf mm9.gtf \

                            --transcript-to-gene-map knownIsoforms.txt \

                            --bowtie \

                            --bowtie-path /sw/bowtie \                  

                            /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \

                            /ref/mouse_0

 

    OR

 

     rsem-prepare-reference --gtf mm9.gtf \

                            --transcript-to-gene-map knownIsoforms.txt \

                            --bowtie \

                            --bowtie-path /sw/bowtie \

                            /data/mm9 \

                            /ref/mouse_0

 

    2) Suppose we also want to build Bowtie 2 indices in the above example and

    Bowtie 2 executables are found in '/sw/bowtie2', the command will be:

 

     rsem-prepare-reference --gtf mm9.gtf \

                            --transcript-to-gene-map knownIsoforms.txt \

                            --bowtie \

                            --bowtie-path /sw/bowtie \

                            --bowtie2 \

                            --bowtie2-path /sw/bowtie2 \

                            /data/mm9 \

                            /ref/mouse_0

 

    3) Suppose we want to build STAR indices in the above example and save

    index files under '/ref' with name 'mouse_0'. Assuming STAR executable is

    '/sw/STAR', the command will be:

 

     rsem-prepare-reference --gtf mm9.gtf \

                            --transcript-to-gene-map knownIsoforms.txt \

                            --star \

                            --star-path /sw/STAR \

                            -p 8 \

                            /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \

                            /ref/mouse_0

 

    OR

 

     rsem-prepare-reference --gtf mm9.gtf \

                            --transcript-to-gene-map knownIsoforms.txt \

                            --star \

                            --star-path /sw/STAR \

                            -p 8 \

                            /data/mm9

                            /ref/mouse_0

 

    STAR genome index files will be saved under '/ref/'.

 

    4) Suppose we want to prepare references for prior-enhanced RSEM in the

    above example. In this scenario, both STAR and Bowtie are required to

    build genomic indices - STAR for RNA-seq reads and Bowtie for ChIP-seq

    reads. Assuming their executables are under '/sw/STAR' and '/sw/Bowtie',

    respectively. Also, assuming the mappability file for mouse genome is

    '/data/mm9.bigWig'. The command will be:

 

     rsem-prepare-reference --gtf mm9.gtf \

                            --transcript-to-gene-map knownIsoforms.txt \

                            --star \

                            --star-path /sw/STAR \

                            -p 8 \

                            --prep-pRSEM \

                            --bowtie-path /sw/Bowtie \

                            --mappability-bigwig-file /data/mm9.bigWig \

                            /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \

                            /ref/mouse_0

 

    OR

 

     rsem-prepare-reference --gtf mm9.gtf \

                            --transcript-to-gene-map knownIsoforms.txt \

                            --star \

                            --star-path /sw/STAR \

                            -p 8 \

                            --prep-pRSEM \

                            --bowtie-path /sw/Bowtie \

                            --mappability-bigwig-file /data/mm9.bigWig \

                            /data/mm9

                            /ref/mouse_0

 

    Both STAR and Bowtie's index files will be saved under '/ref/'. Bowtie

    files will have name prefix 'mouse_0_prsem'

 

    5) Suppose we only have transcripts from EST tags stored in 'mm9.fasta'

    and isoform-gene information stored in 'mapping.txt'. We want to add 125bp

    long poly(A) tails to all transcripts. The reference_name is set as

    'mouse_125'. In addition, we do not want to build Bowtie/Bowtie 2 indices,

    and will use an alternative aligner to align reads against either

    'mouse_125.idx.fa' or 'mouse_125.idx.n2g.fa':

 

     rsem-prepare-reference --transcript-to-gene-map mapping.txt \

                            --polyA

                            mm9.fasta \

                            mouse_125

 

rsem-calculate-expression -h

NAME

    rsem-calculate-expression - Estimate gene and isoform expression from

    RNA-Seq data.

 

SYNOPSIS

     rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name 

     rsem-calculate-expression [options] --paired-end upstream_read_file(s) downstream_read_file(s) reference_name sample_name 

     rsem-calculate-expression [options] --alignments [--paired-end] input reference_name sample_name

 

ARGUMENTS

    upstream_read_files(s)

        Comma-separated list of files containing single-end reads or upstream

        reads for paired-end data. By default, these files are assumed to be

        in FASTQ format. If the --no-qualities option is specified, then FASTA

        format is expected.

 

    downstream_read_file(s)

        Comma-separated list of files containing downstream reads which are

        paired with the upstream reads. By default, these files are assumed to

        be in FASTQ format. If the --no-qualities option is specified, then

        FASTA format is expected.

 

    input

        SAM/BAM/CRAM formatted input file. If "-" is specified for the

        filename, the input is instead assumed to come from standard input.

        RSEM requires all alignments of the same read group together. For

        paired-end reads, RSEM also requires the two mates of any alignment be

        adjacent. In addition, RSEM does not allow the SEQ and QUAL fields to

        be empty. See Description section for how to make input file obey

        RSEM's requirements.

 

    reference_name

        The name of the reference used. The user must have run

        'rsem-prepare-reference' with this reference_name before running this

        program.

 

    sample_name

        The name of the sample analyzed. All output files are prefixed by this

        name (e.g., sample_name.genes.results)

 

BASIC OPTIONS

    --paired-end

        Input reads are paired-end reads. (Default: off)

 

    --no-qualities

        Input reads do not contain quality scores. (Default: off)

 

    --strandedness <none|forward|reverse>

        This option defines the strandedness of the RNA-Seq reads. It

        recognizes three values: 'none', 'forward', and 'reverse'. 'none'

        refers to non-strand-specific protocols. 'forward' means all

        (upstream) reads are derived from the forward strand. 'reverse' means

        all (upstream) reads are derived from the reverse strand. If

        'forward'/'reverse' is set, the '--norc'/'--nofw' Bowtie/Bowtie 2

        option will also be enabled to avoid aligning reads to the opposite

        strand. For Illumina TruSeq Stranded protocols, please use 'reverse'.

        (Default: 'none')

 

    -p/--num-threads <int>

        Number of threads to use. Both Bowtie/Bowtie2, expression estimation

        and 'samtools sort' will use this many threads. (Default: 1)

 

    --alignments

        Input file contains alignments in SAM/BAM/CRAM format. The exact file

        format will be determined automatically. (Default: off)

 

    --fai <file>

        If the header section of input alignment file does not contain

        reference sequence information, this option should be turned on.

        <file> is a FAI format file containing each reference sequence's name

        and length. Please refer to the SAM official website for the details

        of FAI format. (Default: off)

 

    --bowtie2

        Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM

        does not handle indel, local and discordant alignments, the Bowtie2

        parameters are set in a way to avoid those alignments. In particular,

        we use options '--sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1

        --score-min L,0,-0.1' by default. The last parameter of '--score-min',

        '-0.1', is the negative of maximum mismatch rate. This rate can be set

        by option '--bowtie2-mismatch-rate'. If reads are paired-end, we

        additionally use options '--no-mixed' and '--no-discordant'. (Default:

        off)

 

    --star

        Use STAR to align reads. Alignment parameters are from ENCODE3's

        STAR-RSEM pipeline. To save computational time and memory resources,

        STAR's Output BAM file is unsorted. It is stored in RSEM's temporary

        directory with name as 'sample_name.bam'. Each STAR job will have its

        own private copy of the genome in memory. (Default: off)

 

    --hisat2-hca

        Use HISAT2 to align reads to the transcriptome according to Human Cell

        Atlast SMART-Seq2 pipeline. In particular, we use HISAT parameters "-k

        10 --secondary --rg-id=$sampleToken --rg SM:$sampleToken --rg

        LB:$sampleToken --rg PL:ILLUMINA --rg PU:$sampleToken --new-summary

        --summary-file $sampleName.log --met-file $sampleName.hisat2.met.txt

        --met 5 --mp 1,1 --np 1 --score-min L,0,-0.1 --rdg 99999999,99999999

        --rfg 99999999,99999999 --no-spliced-alignment --no-softclip --seed

        12345". If inputs are paired-end reads, we additionally use parameters

        "--no-mixed --no-discordant". (Default: off)

 

    --append-names

        If gene_name/transcript_name is available, append it to the end of

        gene_id/transcript_id (separated by '_') in files

        'sample_name.isoforms.results' and 'sample_name.genes.results'.

        (Default: off)

 

    --seed <uint32>

        Set the seed for the random number generators used in calculating

        posterior mean estimates and credibility intervals. The seed must be a

        non-negative 32 bit integer. (Default: off)

 

    --single-cell-prior

        By default, RSEM uses Dirichlet(1) as the prior to calculate posterior

        mean estimates and credibility intervals. However, much less genes are

        expressed in single cell RNA-Seq data. Thus, if you want to compute

        posterior mean estimates and/or credibility intervals and you have

        single-cell RNA-Seq data, you are recommended to turn on this option.

        Then RSEM will use Dirichlet(0.1) as the prior which encourage the

        sparsity of the expression levels. (Default: off)

 

    --calc-pme

        Run RSEM's collapsed Gibbs sampler to calculate posterior mean

        estimates. (Default: off)

 

    --calc-ci

        Calculate 95% credibility intervals and posterior mean estimates. The

        credibility level can be changed by setting '--ci-credibility-level'.

        (Default: off)

 

    -q/--quiet

        Suppress the output of logging information. (Default: off)

 

    -h/--help

        Show help information.

 

    --version

        Show version information.

 

OUTPUT OPTIONS

    --sort-bam-by-read-name

        Sort BAM file aligned under transcript coordidate by read name.

        Setting this option on will produce deterministic maximum likelihood

        estimations from independent runs. Note that sorting will take long

        time and lots of memory. (Default: off)

 

    --no-bam-output

        Do not output any BAM file. (Default: off)

 

    --sampling-for-bam

        When RSEM generates a BAM file, instead of outputting all alignments a

        read has with their posterior probabilities, one alignment is sampled

        according to the posterior probabilities. The sampling procedure

        includes the alignment to the "noise" transcript, which does not

        appear in the BAM file. Only the sampled alignment has a weight of 1.

        All other alignments have weight 0. If the "noise" transcript is

        sampled, all alignments appeared in the BAM file should have weight 0.

        (Default: off)

 

    --output-genome-bam

        Generate a BAM file, 'sample_name.genome.bam', with alignments mapped

        to genomic coordinates and annotated with their posterior

        probabilities. In addition, RSEM will call samtools (included in RSEM

        package) to sort and index the bam file.

        'sample_name.genome.sorted.bam' and

        'sample_name.genome.sorted.bam.bai' will be generated. (Default: off)

 

    --sort-bam-by-coordinate

        Sort RSEM generated transcript and genome BAM files by coordinates and

        build associated indices. (Default: off)

 

    --sort-bam-memory-per-thread <string>

        Set the maximum memory per thread that can be used by 'samtools sort'.

        <string> represents the memory and accepts suffices 'K/M/G'. RSEM will

        pass <string> to the '-m' option of 'samtools sort'. Note that the

        default used here is different from the default used by samtools.

        (Default: 1G)

 

ALIGNER OPTIONS

    --seed-length <int>

        Seed length used by the read aligner. Providing the correct value is

        important for RSEM. If RSEM runs Bowtie, it uses this value for

        Bowtie's seed length parameter. Any read with its or at least one of

        its mates' (for paired-end reads) length less than this value will be

        ignored. If the references are not added poly(A) tails, the minimum

        allowed value is 5, otherwise, the minimum allowed value is 25. Note

        that this script will only check if the value >= 5 and give a warning

        message if the value < 25 but >= 5. (Default: 25)

 

    --phred33-quals

        Input quality scores are encoded as Phred+33. This option is used by

        Bowtie, Bowtie 2 and HISAT2. (Default: on)

 

    --phred64-quals

        Input quality scores are encoded as Phred+64 (default for GA Pipeline

        ver. >= 1.3). This option is used by Bowtie, Bowtie 2 and HISAT2.

        (Default: off)

 

    --solexa-quals

        Input quality scores are solexa encoded (from GA Pipeline ver. < 1.3).

        This option is used by Bowtie, Bowtie 2 and HISAT2. (Default: off)

 

    --bowtie-path <path>

        The path to the Bowtie executables. (Default: the path to the Bowtie

        executables is assumed to be in the user's PATH environment variable)

 

    --bowtie-n <int>

        (Bowtie parameter) max # of mismatches in the seed. (Range: 0-3,

        Default: 2)

 

    --bowtie-e <int>

        (Bowtie parameter) max sum of mismatch quality scores across the

        alignment. (Default: 99999999)

 

    --bowtie-m <int>

        (Bowtie parameter) suppress all alignments for a read if > <int> valid

        alignments exist. (Default: 200)

 

    --bowtie-chunkmbs <int>

        (Bowtie parameter) memory allocated for best first alignment

        calculation (Default: 0 - use Bowtie's default)

 

    --bowtie2-path <path>

        (Bowtie 2 parameter) The path to the Bowtie 2 executables. (Default:

        the path to the Bowtie 2 executables is assumed to be in the user's

        PATH environment variable)

 

    --bowtie2-mismatch-rate <double>

        (Bowtie 2 parameter) The maximum mismatch rate allowed. (Default: 0.1)

 

    --bowtie2-k <int>

        (Bowtie 2 parameter) Find up to <int> alignments per read. (Default:

        200)

 

    --bowtie2-sensitivity-level <string>

        (Bowtie 2 parameter) Set Bowtie 2's preset options in --end-to-end

        mode. This option controls how hard Bowtie 2 tries to find alignments.

        <string> must be one of "very_fast", "fast", "sensitive" and

        "very_sensitive". The four candidates correspond to Bowtie 2's

        "--very-fast", "--fast", "--sensitive" and "--very-sensitive" options.

        (Default: "sensitive" - use Bowtie 2's default)

 

    --star-path <path>

        The path to STAR's executable. (Default: the path to STAR executable

        is assumed to be in user's PATH environment variable)

 

    --star-gzipped-read-file

        (STAR parameter) Input read file(s) is compressed by gzip. (Default:

        off)

 

    --star-bzipped-read-file

        (STAR parameter) Input read file(s) is compressed by bzip2. (Default:

        off)

 

    --star-output-genome-bam

        (STAR parameter) Save the BAM file from STAR alignment under genomic

        coordinate to 'sample_name.STAR.genome.bam'. This file is NOT sorted

        by genomic coordinate. In this file, according to STAR's manual,

        'paired ends of an alignment are always adjacent, and multiple

        alignments of a read are adjacent as well'. (Default: off)

 

    --hisat2-path <path>

        The path to HISAT2's executable. (Default: the path to HISAT2

        executable is assumed to be in user's PATH environment variable)

 

ADVANCED OPTIONS

    --tag <string>

        The name of the optional field used in the SAM input for identifying a

        read with too many valid alignments. The field should have the format

        <tagName>:i:<value>, where a <value> bigger than 0 indicates a read

        with too many alignments. (Default: "")

 

    --fragment-length-min <int>

        Minimum read/insert length allowed. This is also the value for the

        Bowtie/Bowtie2 -I option. (Default: 1)

 

    --fragment-length-max <int>

        Maximum read/insert length allowed. This is also the value for the

        Bowtie/Bowtie 2 -X option. (Default: 1000)

 

    --fragment-length-mean <double>

        (single-end data only) The mean of the fragment length distribution,

        which is assumed to be a Gaussian. (Default: -1, which disables use of

        the fragment length distribution)

 

    --fragment-length-sd <double>

        (single-end data only) The standard deviation of the fragment length

        distribution, which is assumed to be a Gaussian. (Default: 0, which

        assumes that all fragments are of the same length, given by the

        rounded value of --fragment-length-mean)

 

    --estimate-rspd

        Set this option if you want to estimate the read start position

        distribution (RSPD) from data. Otherwise, RSEM will use a uniform

        RSPD. (Default: off)

 

    --num-rspd-bins <int>

        Number of bins in the RSPD. Only relevant when '--estimate-rspd' is

        specified. Use of the default setting is recommended. (Default: 20)

 

    --gibbs-burnin <int>

        The number of burn-in rounds for RSEM's Gibbs sampler. Each round

        passes over the entire data set once. If RSEM can use multiple

        threads, multiple Gibbs samplers will start at the same time and all

        samplers share the same burn-in number. (Default: 200)

 

    --gibbs-number-of-samples <int>

        The total number of count vectors RSEM will collect from its Gibbs

        samplers. (Default: 1000)

 

    --gibbs-sampling-gap <int>

        The number of rounds between two succinct count vectors RSEM collects.

        If the count vector after round N is collected, the count vector after

        round N + <int> will also be collected. (Default: 1)

 

    --ci-credibility-level <double>

        The credibility level for credibility intervals. (Default: 0.95)

 

    --ci-memory <int>

        Maximum size (in memory, MB) of the auxiliary buffer used for

        computing credibility intervals (CI). (Default: 1024)

 

    --ci-number-of-samples-per-count-vector <int>

        The number of read generating probability vectors sampled per sampled

        count vector. The crebility intervals are calculated by first sampling

        P(C | D) and then sampling P(Theta | C) for each sampled count vector.

        This option controls how many Theta vectors are sampled per sampled

        count vector. (Default: 50)

 

    --keep-intermediate-files

        Keep temporary files generated by RSEM. RSEM creates a temporary

        directory, 'sample_name.temp', into which it puts all intermediate

        output files. If this directory already exists, RSEM overwrites all

        files generated by previous RSEM runs inside of it. By default, after

        RSEM finishes, the temporary directory is deleted. Set this option to

        prevent the deletion of this directory and the intermediate files

        inside of it. (Default: off)

 

    --temporary-folder <string>

        Set where to put the temporary files generated by RSEM. If the folder

        specified does not exist, RSEM will try to create it. (Default:

        sample_name.temp)

 

    --time

        Output time consumed by each step of RSEM to 'sample_name.time'.

        (Default: off)

 

PRIOR-ENHANCED RSEM OPTIONS

    --run-pRSEM

        Running prior-enhanced RSEM (pRSEM). Prior parameters, i.e. isoform's

        initial pseudo-count for RSEM's Gibbs sampling, will be learned from

        input RNA-seq data and an external data set. When pRSEM needs and only

        needs ChIP-seq peak information to partition isoforms (e.g. in pRSEM's

        default partition model), either ChIP-seq peak file (with the

        '--chipseq-peak-file' option) or ChIP-seq FASTQ files for target and

        input and the path for Bowtie executables are required (with the

        '--chipseq-target-read-files <string>', '--chipseq-control-read-files

        <string>', and '--bowtie-path <path> options), otherwise, ChIP-seq

        FASTQ files for target and control and the path to Bowtie executables

        are required. (Default: off)

 

    --chipseq-peak-file <string>

        Full path to a ChIP-seq peak file in ENCODE's narrowPeak, i.e. BED6+4,

        format. This file is used when running prior-enhanced RSEM in the

        default two-partition model. It partitions isoforms by whether they

        have ChIP-seq overlapping with their transcription start site region

        or not. Each partition will have its own prior parameter learned from

        a training set. This file can be either gzipped or ungzipped.

        (Default: "")

 

    --chipseq-target-read-files <string>

        Comma-separated full path of FASTQ read file(s) for ChIP-seq target.

        This option is used when running prior-enhanced RSEM. It provides

        information to calculate ChIP-seq peaks and signals. The file(s) can

        be either ungzipped or gzipped with a suffix '.gz' or '.gzip'. The

        options '--bowtie-path <path>' and '--chipseq-control-read-files

        <string>' must be defined when this option is specified. (Default: "")

 

    --chipseq-control-read-files <string>

        Comma-separated full path of FASTQ read file(s) for ChIP-seq conrol.

        This option is used when running prior-enhanced RSEM. It provides

        information to call ChIP-seq peaks. The file(s) can be either

        ungzipped or gzipped with a suffix '.gz' or '.gzip'. The options

        '--bowtie-path <path>' and '--chipseq-target-read-files <string>' must

        be defined when this option is specified. (Default: "")

 

    --chipseq-read-files-multi-targets <string>

        Comma-separated full path of FASTQ read files for multiple ChIP-seq

        targets. This option is used when running prior-enhanced RSEM, where

        prior is learned from multiple complementary data sets. It provides

        information to calculate ChIP-seq signals. All files can be either

        ungzipped or gzipped with a suffix '.gz' or '.gzip'. When this option

        is specified, the option '--bowtie-path <path>' must be defined and

        the option '--partition-model <string>' will be set to 'cmb_lgt'

        automatically. (Default: "")

 

    --chipseq-bed-files-multi-targets <string>

        Comma-separated full path of BED files for multiple ChIP-seq targets.

        This option is used when running prior-enhanced RSEM, where prior is

        learned from multiple complementary data sets. It provides information

        of ChIP-seq signals and must have at least the first six BED columns.

        All files can be either ungzipped or gzipped with a suffix '.gz' or

        '.gzip'. When this option is specified, the option '--partition-model

        <string>' will be set to 'cmb_lgt' automatically. (Default: "")

 

    --cap-stacked-chipseq-reads

        Keep a maximum number of ChIP-seq reads that aligned to the same

        genomic interval. This option is used when running prior-enhanced

        RSEM, where prior is learned from multiple complementary data sets.

        This option is only in use when either

        '--chipseq-read-files-multi-targets <string>' or

        '--chipseq-bed-files-multi-targets <string>' is specified. (Default:

        off)

 

    --n-max-stacked-chipseq-reads <int>

        The maximum number of stacked ChIP-seq reads to keep. This option is

        used when running prior-enhanced RSEM, where prior is learned from

        multiple complementary data sets. This option is only in use when the

        option '--cap-stacked-chipseq-reads' is set. (Default: 5)

 

    --partition-model <string>

        A keyword to specify the partition model used by prior-enhanced RSEM.

        It must be one of the following keywords:

 

        - pk

          Partitioned by whether an isoform has a ChIP-seq peak overlapping

          with its transcription start site (TSS) region. The TSS region is

          defined as [TSS-500bp, TSS+500bp]. For simplicity, we refer this

          type of peak as 'TSS peak' when explaining other keywords.

 

        - pk_lgtnopk

          First partitioned by TSS peak. Then, for isoforms in the 'no TSS

          peak' set, a logistic model is employed to further classify them

          into two partitions.

 

        - lm3, lm4, lm5, or lm6

          Based on their ChIP-seq signals, isoforms are classified into 3, 4,

          5, or 6 partitions by a linear regression model.

 

        - nopk_lm2pk, nopk_lm3pk, nopk_lm4pk, or

        nopk_lm5pk

          First partitioned by TSS peak. Then, for isoforms in the 'with TSS

          peak' set, a linear regression model is employed to further classify

          them into 2, 3, 4, or 5 partitions.

 

        - pk_lm2nopk, pk_lm3nopk, pk_lm4nopk, or

        pk_lm5nopk

          First partitioned by TSS peak. Then, for isoforms in the 'no TSS

          peak' set, a linear regression model is employed to further classify

          them into 2, 3, 4, or 5 partitions.

 

        - cmb_lgt

          Using a logistic regression to combine TSS signals from multiple

          complementary data sets and partition training set isoform into

          'expressed' and 'not expressed'. This partition model is only in use

          when either '--chipseq-read-files-multi-targets <string>' or

          '--chipseq-bed-files-multi-targets <string> is specified.

 

        Parameters for all the above models are learned from a training set.

        For detailed explanations, please see prior-enhanced RSEM's paper.

        (Default: 'pk')

 

DEPRECATED OPTIONS

        The options in this section are deprecated. They are here only for

        compatibility reasons and may be removed in future releases.

 

    --sam

        Inputs are alignments in SAM format. (Default: off)

 

    --bam

        Inputs are alignments in BAM format. (Default: off)

 

    --strand-specific

        Equivalent to '--strandedness forward'. (Default: off)

 

    --forward-prob <double>

        Probability of generating a read from the forward strand of a

        transcript. Set to 1 for a strand-specific protocol where all

        (upstream) reads are derived from the forward strand, 0 for a

        strand-specific protocol where all (upstream) read are derived from

        the reverse strand, or 0.5 for a non-strand-specific protocol.

        (Default: off)

 

DESCRIPTION

    In its default mode, this program aligns input reads against a reference

    transcriptome with Bowtie and calculates expression values using the

    alignments. RSEM assumes the data are single-end reads with quality

    scores, unless the '--paired-end' or '--no-qualities' options are

    specified. Alternatively, users can use STAR to align reads using the

    '--star' option. RSEM has provided options in 'rsem-prepare-reference' to

    prepare STAR's genome indices. Users may use an alternative aligner by

    specifying '--alignments', and providing an alignment file in SAM/BAM/CRAM

    format. However, users should make sure that they align against the

    indices generated by 'rsem-prepare-reference' and the alignment file

    satisfies the requirements mentioned in ARGUMENTS section.

 

    One simple way to make the alignment file satisfying RSEM's requirements

    is to use the 'convert-sam-for-rsem' script. This script accepts

    SAM/BAM/CRAM files as input and outputs a BAM file. For example, type the

    following command to convert a SAM file, 'input.sam', to a ready-for-use

    BAM file, 'input_for_rsem.bam':

 

      convert-sam-for-rsem input.sam input_for_rsem

 

    For details, please refer to 'convert-sam-for-rsem's documentation page.

 

NOTES

    1. Users must run 'rsem-prepare-reference' with the appropriate reference

    before using this program.

 

    2. For single-end data, it is strongly recommended that the user provide

    the fragment length distribution parameters (--fragment-length-mean and

    --fragment-length-sd). For paired-end data, RSEM will automatically learn

    a fragment length distribution from the data.

 

    3. Some aligner parameters have default values different from their

    original settings.

 

    4. With the '--calc-pme' option, posterior mean estimates will be

    calculated in addition to maximum likelihood estimates.

 

    5. With the '--calc-ci' option, 95% credibility intervals and posterior

    mean estimates will be calculated in addition to maximum likelihood

    estimates.

 

    6. The temporary directory and all intermediate files will be removed when

    RSEM finishes unless '--keep-intermediate-files' is specified.

 

    With the '--run-pRSEM' option and associated options (see section

    'PRIOR-ENHANCED RSEM OPTIONS' above for details), prior-enhanced RSEM will

    be running. Prior parameters will be learned from supplied external data

    set(s) and assigned as initial pseudo-counts for isoforms in the

    corresponding partition for Gibbs sampling.

 

OUTPUT

    sample_name.isoforms.results

        File containing isoform level expression estimates. The first line

        contains column names separated by the tab character. The format of

        each line in the rest of this file is:

 

        transcript_id gene_id length effective_length expected_count TPM FPKM

        IsoPct [posterior_mean_count posterior_standard_deviation_of_count

        pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound

        TPM_ci_upper_bound TPM_coefficient_of_quartile_variation

        FPKM_ci_lower_bound FPKM_ci_upper_bound

        FPKM_coefficient_of_quartile_variation]

 

        Fields are separated by the tab character. Fields within "" are

        optional. They will not be presented if neither '--calc-pme' nor

        '--calc-ci' is set.

 

        'transcript_id' is the transcript name of this transcript. 'gene_id'

        is the gene name of the gene which this transcript belongs to (denote

        this gene as its parent gene). If no gene information is provided,

        'gene_id' and 'transcript_id' are the same.

 

        'length' is this transcript's sequence length (poly(A) tail is not

        counted). 'effective_length' counts only the positions that can

        generate a valid fragment. If no poly(A) tail is added,

        'effective_length' is equal to transcript length - mean fragment

        length + 1. If one transcript's effective length is less than 1, this

        transcript's both effective length and abundance estimates are set to

        0.

 

        'expected_count' is the sum of the posterior probability of each read

        comes from this transcript over all reads. Because 1) each read

        aligning to this transcript has a probability of being generated from

        background noise; 2) RSEM may filter some alignable low quality reads,

        the sum of expected counts for all transcript are generally less than

        the total number of reads aligned.

 

        'TPM' stands for Transcripts Per Million. It is a relative measure of

        transcript abundance. The sum of all transcripts' TPM is 1 million.

        'FPKM' stands for Fragments Per Kilobase of transcript per Million

        mapped reads. It is another relative measure of transcript abundance.

        If we define l_bar be the mean transcript length in a sample, which

        can be calculated as

 

        l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every

        transcript),

 

        the following equation is hold:

 

        FPKM_i = 10^3 / l_bar * TPM_i.

 

        We can see that the sum of FPKM is not a constant across samples.

 

        'IsoPct' stands for isoform percentage. It is the percentage of this

        transcript's abandunce over its parent gene's abandunce. If its parent

        gene has only one isoform or the gene information is not provided,

        this field will be set to 100.

 

        'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean

        estimates calculated by RSEM's Gibbs sampler.

        'posterior_standard_deviation_of_count' is the posterior standard

        deviation of counts. 'IsoPct_from_pme_TPM' is the isoform percentage

        calculated from 'pme_TPM' values.

 

        'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and

        'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95%

        credibility intervals for TPM and FPKM values. The bounds are

        inclusive (i.e. [l, u]).

 

        'TPM_coefficient_of_quartile_variation' and

        'FPKM_coefficient_of_quartile_variation' are coefficients of quartile

        variation (CQV) for TPM and FPKM values. CQV is a robust way of

        measuring the ratio between the standard deviation and the mean. It is

        defined as

 

        CQV := (Q3 - Q1) / (Q3 + Q1),

 

        where Q1 and Q3 are the first and third quartiles.

 

    sample_name.genes.results

        File containing gene level expression estimates. The first line

        contains column names separated by the tab character. The format of

        each line in the rest of this file is:

 

        gene_id transcript_id(s) length effective_length expected_count TPM

        FPKM [posterior_mean_count posterior_standard_deviation_of_count

        pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound

        TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound

        FPKM_ci_upper_bound FPKM_coefficient_of_quartile_variation]

 

        Fields are separated by the tab character. Fields within "" are

        optional. They will not be presented if neither '--calc-pme' nor

        '--calc-ci' is set.

 

        'transcript_id(s)' is a comma-separated list of transcript_ids

        belonging to this gene. If no gene information is provided, 'gene_id'

        and 'transcript_id(s)' are identical (the 'transcript_id').

 

        A gene's 'length' and 'effective_length' are defined as the weighted

        average of its transcripts' lengths and effective lengths (weighted by

        'IsoPct'). A gene's abundance estimates are just the sum of its

        transcripts' abundance estimates.

 

    sample_name.alleles.results

        Only generated when the RSEM references are built with allele-specific

        transcripts.

 

        This file contains allele level expression estimates for

        allele-specific expression calculation. The first line contains column

        names separated by the tab character. The format of each line in the

        rest of this file is:

 

        allele_id transcript_id gene_id length effective_length expected_count

        TPM FPKM AlleleIsoPct AlleleGenePct [posterior_mean_count

        posterior_standard_deviation_of_count pme_TPM pme_FPKM

        AlleleIsoPct_from_pme_TPM AlleleGenePct_from_pme_TPM

        TPM_ci_lower_bound TPM_ci_upper_bound

        TPM_coefficient_of_quartile_variation FPKM_ci_lower_bound

        FPKM_ci_upper_bound FPKM_coefficient_of_quartile_variation]

 

        Fields are separated by the tab character. Fields within "[]" are

        optional. They will not be presented if neither '--calc-pme' nor

        '--calc-ci' is set.

 

        'allele_id' is the allele-specific name of this allele-specific

        transcript.

 

        'AlleleIsoPct' stands for allele-specific percentage on isoform level.

        It is the percentage of this allele-specific transcript's abundance

        over its parent transcript's abundance. If its parent transcript has

        only one allele variant form, this field will be set to 100.

 

        'AlleleGenePct' stands for allele-specific percentage on gene level.

        It is the percentage of this allele-specific transcript's abundance

        over its parent gene's abundance.

 

        'AlleleIsoPct_from_pme_TPM' and 'AlleleGenePct_from_pme_TPM' have

        similar meanings. They are calculated based on posterior mean

        estimates.

 

        Please note that if this file is present, the fields 'length' and

        'effective_length' in 'sample_name.isoforms.results' should be

        interpreted similarly as the corresponding definitions in

        'sample_name.genes.results'.

 

    sample_name.transcript.bam

        Only generated when --no-bam-output is not specified.

 

        'sample_name.transcript.bam' is a BAM-formatted file of read

        alignments in transcript coordinates. The MAPQ field of each alignment

        is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the

        posterior probability of that alignment being the true mapping of a

        read. In addition, RSEM pads a new tag ZW:f:value, where value is a

        single precision floating number representing the posterior

        probability. Because this file contains all alignment lines produced

        by bowtie or user-specified aligners, it can also be used as a

        replacement of the aligner generated BAM/SAM file.

 

    sample_name.transcript.sorted.bam and

    sample_name.transcript.sorted.bam.bai

        Only generated when --no-bam-output is not specified and

        --sort-bam-by-coordinate is specified.

 

        'sample_name.transcript.sorted.bam' and

        'sample_name.transcript.sorted.bam.bai' are the sorted BAM file and

        indices generated by samtools (included in RSEM package).

 

    sample_name.genome.bam

        Only generated when --no-bam-output is not specified and

        --output-genome-bam is specified.

 

        'sample_name.genome.bam' is a BAM-formatted file of read alignments in

        genomic coordinates. Alignments of reads that have identical genomic

        coordinates (i.e., alignments to different isoforms that share the

        same genomic region) are collapsed into one alignment. The MAPQ field

        of each alignment is set to min(100, floor(-10 * log10(1.0 - w) +

        0.5)), where w is the posterior probability of that alignment being

        the true mapping of a read. In addition, RSEM pads a new tag

        ZW:f:value, where value is a single precision floating number

        representing the posterior probability. If an alignment is spliced, a

        XS:A:value tag is also added, where value is either '+' or '-'

        indicating the strand of the transcript it aligns to.

 

    sample_name.genome.sorted.bam and

    sample_name.genome.sorted.bam.bai

        Only generated when --no-bam-output is not specified, and

        --sort-bam-by-coordinate and --output-genome-bam are specified.

 

        'sample_name.genome.sorted.bam' and

        'sample_name.genome.sorted.bam.bai' are the sorted BAM file and

        indices generated by samtools (included in RSEM package).

 

    sample_name.time

        Only generated when --time is specified.

 

        It contains time (in seconds) consumed by aligning reads, estimating

        expression levels and calculating credibility intervals.

 

    sample_name.log

        Only generated when --alignments is not specified.

 

        It captures alignment statistics outputted from the user-specified

        aligner.

 

    sample_name.stat

        This is a folder instead of a file. All model related statistics are

        stored in this folder. Use 'rsem-plot-model' can generate plots using

        this folder.

 

        'sample_name.stat/sample_name.cnt' contains alignment statistics. The

        format and meanings of each field are described in

        'cnt_file_description.txt' under RSEM directory.

 

        'sample_name.stat/sample_name.model' stores RNA-Seq model parameters

        learned from the data. The format and meanings of each filed of this

        file are described in 'model_file_description.txt' under RSEM

        directory.

 

        The following four output files will be generated only by

        prior-enhanced RSEM

 

        - 'sample_name.stat/sample_name_prsem.all_tr_features'

          It stores isofrom features for deriving and assigning pRSEM prior.

          The first line is a header and the rest is one isoform per line. The

          description for each column is:

 

          * trid: transcript ID from input annotation

 

          * geneid: gene ID from input anntation

 

          * chrom: isoform's chromosome name

 

          * strand: isoform's strand name

 

          * start: isoform's end with the lowest genomic loci

 

          * end: isoform's end with the highest genomic loci

 

          * tss_mpp: average mappability of [TSS-500bp, TSS+500bp], where TSS

            is isoform's transcription start site, i.e. 5'-end

 

          * body_mpp: average mappability of (TSS+500bp, TES-500bp), where TES

            is isoform's transcription end site, i.e. 3'-end

 

          * tes_mpp: average mappability of [TES-500bp, TES+500bp]

 

          * pme_count: isoform's fragment or read count from RSEM's posterior

            mean estimates

 

          * tss: isoform's TSS loci

 

          * tss_pk: equal to 1 if isoform's [TSS-500bp, TSS+500bp] region

            overlaps with a RNA Pol II peak; 0 otherwise

 

          * is_training: equal to 1 if isoform is in the training set where

            Pol II prior is learned; 0 otherwise

 

        - 'sample_name.stat/sample_name_prsem.all_tr_prior'

          It stores prior parameters for every isoform. This file does not

          have a header. Each line contains a prior parameter and an isoform's

          transcript ID delimited by ` # `.

 

        - 'sample_name.stat/sample_name_uniform_prior_1.isoforms.results'

          RSEM's posterior mean estimates on the isoform level with an initial

          pseudo-count of one for every isoform. It is in the same format as

          the 'sample_name.isoforms.results'.

 

        - 'sample_name.stat/sample_name_uniform_prior_1.genes.results'

          RSEM's posterior mean estimates on the gene level with an initial

          pseudo-count of one for every isoform. It is in the same format as

          the 'sample_name.genes.results'.

 

        When learning prior from multiple external data sets in prior-enhanced

        RSEM, two additional output files will be generated.

 

        - 'sample_name.stat/sample_name.pval_LL'

          It stores a p-value and a log-likelihood. The p-value indicates

          whether the combination of multiple complementary data sets is

          informative for RNA-seq quantification. The log-likelihood shows how

          well pRSEM's Dirichlet-multinomial model fits the read counts of

          partitioned training set isoforms.

 

        - 'sample_name.stat/sample_name.lgt_mdl.RData'

          It stores an R object named 'glmmdl', which is a logistic regression

          model on the training set isoforms and multiple external data sets.

 

        In addition, extra columns will be added to

        'sample_name.stat/all_tr_features'

 

        * is_expr: equal to 1 if isoform has an abundance >= 1 TPM and a

          non-zero read count from RSEM's posterior mean estimates; 0

          otherwise

 

        * "$external_data_set_basename": log10 of external data's signal at

          [TSS-500, TSS+500]. Signal is the number of reads aligned within

          that interval and normalized to RPKM by read depth and interval

          length. It will be set to -4 if no read aligned to that interval.

 

          There are multiple columns like this one, where each represents an

          external data set.

 

        * prd_expr_prob: predicted probability from logistic regression model

          on whether this isoform is expressed or not. A probability higher

          than 0.5 is considered as expressed

 

        * partition: group index, to which this isoforms is partitioned

 

        * prior: prior parameter for this isoform

 

EXAMPLES

    Assume the path to the bowtie executables is in the user's PATH

    environment variable. Reference files are under '/ref' with name

    'mouse_125'.

 

    1) '/data/mmliver.fq', single-end reads with quality scores. Quality

    scores are encoded as for 'GA pipeline version >= 1.3'. We want to use 8

    threads and generate a genome BAM file. In addition, we want to append

    gene/transcript names to the result files:

 

     rsem-calculate-expression --phred64-quals \

                               -p 8 \

                               --append-names \

                               --output-genome-bam \

                               /data/mmliver.fq \

                               /ref/mouse_125 \

                               mmliver_single_quals

 

    2) '/data/mmliver_1.fq' and '/data/mmliver_2.fq', stranded paired-end

    reads with quality scores. Suppose the library is prepared using TruSeq

    Stranded Kit, which means the first mate should map to the reverse strand.

    Quality scores are in SANGER format. We want to use 8 threads and do not

    generate a genome BAM file:

 

     rsem-calculate-expression -p 8 \

                               --paired-end \

                               --strandedness reverse \

                               /data/mmliver_1.fq \

                               /data/mmliver_2.fq \

                               /ref/mouse_125 \

                               mmliver_paired_end_quals

 

    3) '/data/mmliver.fa', single-end reads without quality scores. We want to

    use 8 threads:

 

     rsem-calculate-expression -p 8 \

                               --no-qualities \

                               /data/mmliver.fa \

                               /ref/mouse_125 \

                               mmliver_single_without_quals

 

    4) Data are the same as 1). This time we assume the bowtie executables are

    under '/sw/bowtie'. We want to take a fragment length distribution into

    consideration. We set the fragment length mean to 150 and the standard

    deviation to 35. In addition to a BAM file, we also want to generate

    credibility intervals. We allow RSEM to use 1GB of memory for CI

    calculation:

 

     rsem-calculate-expression --bowtie-path /sw/bowtie \

                               --phred64-quals \

                               --fragment-length-mean 150.0 \

                               --fragment-length-sd 35.0 \

                               -p 8 \

                               --output-genome-bam \

                               --calc-ci \

                               --ci-memory 1024 \

                               /data/mmliver.fq \

                               /ref/mouse_125 \

                               mmliver_single_quals

 

    5) '/data/mmliver_paired_end_quals.bam', BAM-formatted alignments for

    paired-end reads with quality scores. We want to use 8 threads:

 

     rsem-calculate-expression --paired-end \

                               --alignments \

                               -p 8 \

                               /data/mmliver_paired_end_quals.bam \

                               /ref/mouse_125 \

                               mmliver_paired_end_quals

 

    6) '/data/mmliver_1.fq.gz' and '/data/mmliver_2.fq.gz', paired-end reads

    with quality scores and read files are compressed by gzip. We want to use

    STAR to aligned reads and assume STAR executable is '/sw/STAR'. Suppose we

    want to use 8 threads and do not generate a genome BAM file:

 

     rsem-calculate-expression --paired-end \

                               --star \

                               --star-path /sw/STAR \

                               --gzipped-read-file \

                               --paired-end \

                               -p 8 \

                               /data/mmliver_1.fq.gz \

                               /data/mmliver_2.fq.gz \

                               /ref/mouse_125 \

                               mmliver_paired_end_quals

 

    7) In the above example, suppose we want to run prior-enhanced RSEM

    instead. Assuming we want to learn priors from a ChIP-seq peak file

    '/data/mmlive.narrowPeak.gz':

 

     rsem-calculate-expression --star \

                               --star-path /sw/STAR \

                               --gzipped-read-file \

                               --paired-end \

                               --calc-pme \

                               --run-pRSEM \

                               --chipseq-peak-file /data/mmliver.narrowPeak.gz \

                               -p 8 \

                               /data/mmliver_1.fq.gz \

                               /data/mmliver_2.fq.gz \

                               /ref/mouse_125 \

                               mmliver_paired_end_quals

 

    8) Similar to the example in 7), suppose we want to use the partition

    model 'pk_lm2nopk' (partitioning isoforms by Pol II TSS peak first and

    then partitioning 'no TSS peak' isoforms into two bins by a linear

    regression model), and we want to partition isoforms by RNA Pol II's

    ChIP-seq read files '/data/mmliver_PolIIRep1.fq.gz' and

    '/data/mmliver_PolIIRep2.fq.gz', and the control ChIP-seq read files

    '/data/mmliver_ChIPseqCtrl.fq.gz'. Also, assuming Bowtie's executables are

    under '/sw/bowtie/':

 

     rsem-calculate-expression --star \

                               --star-path /sw/STAR \

                               --gzipped-read-file \

                               --paired-end \

                               --calc-pme \

                               --run-pRSEM \

                               --chipseq-target-read-files /data/mmliver_PolIIRep1.fq.gz,/data/mmliver_PolIIRep2.fq.gz \

                               --chipseq-control-read-files /data/mmliver_ChIPseqCtrl.fq.gz \

                               --partition-model pk_lm2nopk \

                               --bowtie-path /sw/bowtie \

                               -p 8 \

                               /data/mmliver_1.fq.gz \

                               /data/mmliver_2.fq.gz \

                               /ref/mouse_125 \

                               mmliver_paired_end_quals

 

    9) Similar to the example in 8), suppose we want to derive prior from four

    histone modification ChIP-seq read data sets: '/data/H3K27Ac.fastq.gz',

    '/data/H3K4me1.fastq.gz', '/data/H3K4me2.fastq.gz', and

    '/data/H3K4me3.fastq.gz'. Also, assuming Bowtie's executables are under

    '/sw/bowtie/':

 

     rsem-calculate-expression --star \

                               --star-path /sw/STAR \

                               --gzipped-read-file \

                               --paired-end \

                               --calc-pme \

                               --run-pRSEM \

                               --partition-model cmb_lgt \

                               --chipseq-read-files-multi-targets /data/H3K27Ac.fastq.gz,/data/H3K4me1.fastq.gz,/data/H3K4me2.fastq.gz,/data/H3K4me3.fastq.gz \

                               --bowtie-path /sw/bowtie \

                               -p 8 \

                               /data/mmliver_1.fq.gz \

                               /data/mmliver_2.fq.gz \

                               /ref/mouse_125 \

                               mmliver_paired_end_quals

 

rsem-generate-data-matrix

$ rsem-generate-data-matrix

Usage: rsem-generate-data-matrix sampleA.[alleles/genes/isoforms].results sampleB.[alleles/genes/isoforms].results ... > output_name.matrix

All result files should have the same file type. The 'expected_count' columns of every result file are extracted to form the data matrix.

他にも多くのコマンドがある。

> rsem-

$ rsem-

rsem-bam2readdepth

rsem-extract-reference-transcripts

rsem-get-unique

rsem-preref

rsem-scan-for-paired-end-reads                                                   rsem-bam2wig                                                                  rsem-for-ebseq-calculate-clustering-info

rsem-gff3-to-gtf                    

rsem-refseq-extract-primary-assembly

rsem-simulate-reads

rsem-build-read-index

rsem-for-ebseq-find-DE

rsem-parse-alignments

rsem-run-ebseq

rsem-synthesis-reference-transcripts

rsem-calculate-credibility-intervals

rsem-for-ebseq-generate-ngvector-from-clustering-info
rsem-plot-model

rsem-run-em

rsem-tbam2gbam

rsem-calculate-expression

rsem-gen-transcript-plots

rsem-plot-transcript-wiggles

rsem-run-gibbs                                         

rsem-control-fdr

rsem-generate-data-matrix

rsem-prepare-reference

rsem-sam-validator                                     

 

 

実行方法

1、indexing

GTF ファイルとゲノムのfasta、最後にindex名を指定する(1と同じにする)。ランの過程でbowtie2/star/hisat2のindexも作成される( この例ではxxx.index.〜)。--gff3を使えばGTFの代わりにGFF3のアノテーションを与えることもできる。

#bowtie2
rsem-prepare-reference --gtf genome.gtf --bowtie2 --bowtie2-path <path>/<to>/<your>/<bowtie2-path> -p 20 genome.fa bowtie2_index

#star
rsem-prepare-reference --gtf genome.gtf --star --star-path <path>/<to>/<your>/<star-path> -p 20 genome.fa star_index

#hisat2
rsem-prepare-reference --gtf genome.gtf --hisat2-hca --hisat2-path <path>/<to>/<your>/<hisat2-path> -p 20 genome.fa hisat2_index
  • --gtf   If this option is on, RSEM assumes that 'reference_fasta_file(s)'  contains the sequence of a genome, and will extract transcript reference sequences using the gene annotations specified in <file>,  which should be in GTF format.
    If this and '--gff3' options are off, RSEM will assume  'reference_fasta_file(s)' contains the reference transcripts. In this   case, RSEM assumes that name of each sequence in the Multi-FASTA files  is its transcript_id. (Default: off)
  • --gff3   The annotation file is in GFF3 format instead of GTF format. RSEM will first convert it to GTF format with the file name 'reference_name.gtf'. Please make sure that 'reference_name.gtf' does not exist. (Default: off)
  • --bowtie2   Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM
    does not handle indel, local and discordant alignments, the Bowtie2 parameters are set in a way to avoid those alignments. In particular, we use options '- sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1' by default. The last parameter of '--score-min', '-0.1', is the negative of maximum mismatch rate. This rate can be set by option '--bowtie2-mismatch-rate'. If reads are paired-end, we additionally use options '--no-mixed' and '--no-discordant'. (Default: off)
  • --star   Use STAR to align reads. Alignment parameters are from ENCODE3's
    STAR-RSEM pipeline. To save computational time and memory resources,
    STAR's Output BAM file is unsorted. It is stored in RSEM's temporary
    directory with name as 'sample_name.bam'. Each STAR job will have
    its own private copy of the genome in memory. (Default: off)
  • --hisat2-hca   Use HISAT2 to align reads to the transcriptome according to Human. Cell Atlast SMART-Seq2 pipeline. In particular, we use HISAT
    parameters "-k 10 --secondary --rg-id=$sampleToken --rg
    SM:$sampleToken --rg LB:$sampleToken --rg PL:ILLUMINA --rg
    PU:$sampleToken --new-summary --summary-file $sampleName.log
    --met-file $sampleName.hisat2.met.txt --met 5 --mp 1,1 --np 1
    --score-min L,0,-0.1 --rdg 99999999,99999999 --rfg 99999999,99999999
    --no-spliced-alignment --no-softclip --seed 12345". If inputs are
    paired-end reads, we additionally use parameters "--no-mixed
    --no-discordant". (Default: off)
  • --transcript-to-gene-map    Use information from <file> to map from transcript (isoform) ids to  gene ids. Each line of <file> should be of the form:  gene_id transcript_id with the two fields separated by a tab character. If you are using a GTF file for the "UCSC Genes" gene set from the UCSC Genome Browser, then the "knownIsoforms.txt" file (obtained from the "Downloads" section of the UCSC Genome Browser site) is of this format.  If this option is off, then the mapping of isoforms to genes depends on whether the '--gtf' option is specified. If '--gtf' is specified,  then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF  file. Otherwise, RSEM assumes that each sequence in the reference sequence files is a separate gene.  (Default: off)
  • --prep-pRSEM    A Boolean indicating whether to prepare reference files for pRSEM,   including building Bowtie indices for a genome and selecting training set isoforms. The index files will be used for aligning ChIP-seq reads   in prior-enhanced RSEM and the training set isoforms will be used for   learning prior. A path to Bowtie executables and a mappability file in  bigWig format are required when this option is on. Currently, Bowtie2  is not supported for prior-enhanced RSEM. (Default: off)
  • --mappability-bigwig-file   Full path to a whole-genome mappability file in bigWig format. This   file is required for running prior-enhanced RSEM. It is used for  selecting a training set of isoforms for prior-learning. This file can be either downloaded from UCSC Genome Browser or generated by GEM (Derrien et al., 2012, PLoS One). (Default: "")
  • --strandedness <none|forward|reverse>  This option defines the strandedness of the RNA-Seq reads. It recognizes three values: 'none', 'forward', and 'reverse'. 'none'  refers to non-strand-specific protocols. 'forward' means all  (upstream) reads are derived from the forward strand. 'reverse' means    all (upstream) reads are derived from the reverse strand. If  'forward'/'reverse' is set, the '--norc'/'--nofw' Bowtie/Bowtie 2 option will also be enabled to avoid aligning reads to the opposite   strand. For Illumina TruSeq Stranded protocols, please use 'reverse'.    (Default: 'none')

     

インデックスxxx.index~が出力される。注意点として、STARを使う場合はメモリが多い環境で実行すること(32GB程度では足りないことが多い)。メモリが少ない環境でSTAR indexを作成すると、genomeParameterファイルが出力されない事がある。このファイルがないと次のステップでエラーを起こす。

--transcript-to-gene-map”についてはこちらを参照。

How to get --transcript-to-gene-map <file> in RSEM?

 

2、rsem-calculate

オプションの後にペアエンドfastq、index名(bowtie2の場合、step1のコマンドでbowtie2.indexと名前をつけているので”bowtie2.index”と指定)、出力ファイル名を指定する。gzipped fastqは使えないので解凍して指定するか"--gzipped-read-file"オプションをつける(注2)。シングルエンドなら--paired-endを外す。

#bowtie2
rsem-calculate-expression --paired-end -p 20 --bowtie2 --bowtie2-path <path>/<to>/<your>/<bowtie2-path> \
sample1_R1.fq sample1_R2.fq bowtie2_index sample1

#star
rsem-calculate-expression --paired-end -p 20 --star --star-path <path>/<to>/<your>/<star-path> \
sample1_R1.fq sample1_R2.fq star_index sample1

#hisat2
rsem-calculate-expression --paired-end -p 20 --hisat2-hca --hisat2-path <path>/<to>/<your>/<hisat2-path> \
sample1_R1.fq sample1_R2.fq hisat2_index sample1

sample1.genes.resultsとsample1.isofomrs.resultsその他のファイルが出力される。

bamが最後の方で生じ、ディスクに保存される。このリファレンスはゲノムでは無く転写産物リファレンスである。転写産物をリファレンスにしてマッピングしているので、IGV等のビューアで見る際には間違えないようにしたい。STARならstar_index.transcripts.faになる(別の見方をすれば、転写産物に当てているので、bowtieではスプリットアラインメントはどうなっているのかとか、hisat2 等のスプリットアラインメントの距離の最大値等は気にしなくて良い)。

 

3、複数の結果を統合 (expected countの統合*1)

#gene level
rsem-generate-data-matrix sample*genes.results > output

#transcripts(isoform) level
rsem-generate-data-matrix sample*isoforms.results > output

 expected read count(RSEMの確率的リードカウント値で正規化されていない生の値)の表が出力される。先頭行の名前だけ修正してiDEP(修正)などにロードすれば、すぐに結果を得ることができる。その時はまずPCAなどを行なって、妥当なグループになっているかどうか確認すること。

 

 

その他

マッピングを別途実行し、その結果のbamファイルRSEMに提供することもできる。その場合はアラインメントオプション、特にリピートのマッピング設定に注意する。

例えばSTARの場合

1、indexing and mapping

#index
STAR --runMode genomeGenerate --genomeDir STAR_index --genomeFastaFiles genome.fa --sjdbGTFfile annotation.gtf

#mapping
STAR --runThreadN 12 --genomeDir STAR_index --readFilesIn pair_R1.fq.gz pair_R2.fq.gz --readFilesCommand zcat --quantMode TranscriptomeSAM --outSAMtype BAM SortedByCoordinate --genomeLoad NoSharedMemory --outFilterMultimapNmax 1 --outFileNamePrefix sample1

2、read count

bamファイルを指定する際は--alignmentsのフラグを立てる。

rsem-calculate-expression --alignments --paired-end -p 20 --bowtie2 input.bam bowtie2_index sample1
  • --alignments   Input file contains alignments in SAM/BAM/CRAM format. The exact file format will be determined automatically. (Default: off)

 

引用

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

Bo Li, Colin N Dewey

BMC Bioinformatics. 2011 Aug 4 

 

参考


 関連


*1 

genes.results やisofomrs.resultsにはカウント値として確率的カウント値、TPMとFPKMの正規化されたカウント値の3つプリントされている。rsem-generate-data-matrixやrsem-generate-data-matrixは確率的カウント値を統合する。切り替えるオプションはないので、TPMやFPKMを取り出したいならその列だけ抜き出して横に繋ぐ。

#TPM
cut -f 6 sample1.genes.results > sample1
cut -f 6 sample2.genes.results > sample2
cut -f 6 sample3.genes.results > sample3
paste sample1 sample2 sample3 > TPM

#FPKM
cut -f 7 sample1.genes.results > sample1
cut -f 7 sample2.genes.results > sample2
cut -f 7 sample3.genes.results > sample3
paste sample1 sample2 sample3 > FPKM

 

*2

ただし、このオプションがないrsemのバージョンもある。また、gzipped fastqを暗黙的に受け付けるが、結果がかなり変わってしまうことがあった(自分だけか不明)。ややこしければ、解凍したfastqを指定するか、アラインは自分で行って、bamからrsemをスタートする。

 

追記

ベンチマークペーパー

2つの独立したデータセットを用いて、7つの競合パイプラインを評価した。性能は全般的に低く、2つの手法が明らかに劣っており、RSEMが残りの手法をわずかに上回っていた。


関連