2019 12/2 インストール追記
2020 12/9 誤字修正, help追加
2022 インストール手順追記
RSeQCはクオリティ、GCバイアス、PCRバイアス、ヌクレオチド組成バイアス、シーケンスのデプス、strandの特異性、カバレッジ均一性およびゲノムのfeature上のカバレッジ分布など、RNA-seq実験を総合的に評価するパッケージ。SAMとBAMを入力として利用できる。可視化にはRを利用している。
マニュアル
インストール
依存
- gcc
- python2.7
- numpy
- R
#bioconda(link)
mamba create -n RSeQC -y
conda activate RSeQC
mamba install -c bioconda -y rseqc -y
#pip
pip install RSeQC
> bam2fq.py -h
$ bam2fq.py -h
Usage: bam2fq.py [options]
Description: Convert alignments in BAM or SAM format into fastq format.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output fastq files(s).
-s, --single-end Specificy '-s' or '--single-end' for single-end
sequencing.
-c, --compress Specificy '-c' or '--compress' to compress output
fastq file(s) using 'gzip' command.
> bam2wig.py -h
$ bam2wig.py -h
Usage: bam2wig.py [options]
Convert BAM file into wig file. BAM file must be sorted and indexed using SAMtools.
Note: SAM format file is not supported.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM format. BAM file must be sorted
and indexed using samTools. .bam and .bai files should
be placed in the same directory.
-s CHROMSIZE, --chromSize=CHROMSIZE
Chromosome size file. Tab or space separated text file
with 2 columns: first column is chromosome name/ID,
second column is chromosome size. Chromosome name
(such as "chr1") should be consistent between this
file and the BAM file.
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output wiggle files(s). One wiggle file will
be generated for non strand-specific data, two wiggle
files ("Prefix_Forward.wig" and "Prefix_Reverse.wig")
will be generated for strand-specific RNA-seq data.
-t TOTAL_WIGSUM, --wigsum=TOTAL_WIGSUM
Specified wigsum. Eg: 1,000,000,000 equals to coverage
of 10 million 100nt reads. Ignore this option to
disable normalization
-u, --skip-multi-hits
Skip non-unique hit reads.
-d STRAND_RULE, --strand=STRAND_RULE
How read(s) were stranded during sequencing. For
example: --strand='1++,1--,2+-,2-+' means that this is
a pair-end, strand-specific RNA-seq data, and the
strand rule is: read1 mapped to '+' => parental gene
on '+'; read1 mapped to '-' => parental gene on '-';
read2 mapped to '+' => parental gene on '-'; read2
mapped to '-' => parental gene on '+'. If you are not
sure about the strand rule, run 'infer_experiment.py'
default=none (Not a strand specific RNA-seq data).
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality to determine "uniquely
mapped". default=30
> bam_stat.py -h
$ bam_stat.py -h
Usage: bam_stat.py [options]
Summarizing mapping statistics of a BAM or SAM file.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) to determine
"uniquely mapped" reads. default=30
> mismatch_profile.py -h
$ mismatch_profile.py -h
Usage: mismatch_profile.py [options]
Calculate the distribution of mismatches across reads. Note that the "MD" tag must exist in BAM file.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_BAM, --input=INPUT_BAM
Input BAM file. [required]
-l READ_ALIGNMENT_LENGTH, --read-align-length=READ_ALIGNMENT_LENGTH
Alignment length of read. It is usually set to the
orignial read length. For example, all these cigar
strings ("101M", "68M140N33M", "53M1D48M") suggest the
read alignment length is 101. [required]
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s). [required]
-n READ_NUMBER, --read-num=READ_NUMBER
Number of aligned reads with mismatches used to
calculate the mismatch profile. default=1000000
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality. default=30
> clipping_profile.py -h
$ clipping_profile.py -h
Usage: clipping_profile.py [options]
This program is used estimate clipping profile of RNA-seq reads from BAM or SAM file
Note that to use this funciton, CIGAR strings within SAM/BAM file should have 'S' operation
(This means your reads mapper should support clipped mapping).
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s).
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be considered as "uniquely mapped".
default=30
-s LAYOUT, --sequencing=LAYOUT
Sequencing layout. "SE"(single-end) or "PE"(pair-end).
> deletion_profile.py -h
$ deletion_profile.py -h
Usage: deletion_profile.py [options]
Calculate the distributions of deleted nucleotides across reads.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_BAM, --input=INPUT_BAM
Input BAM file. [required]
-l READ_ALIGNMENT_LENGTH, --read-align-length=READ_ALIGNMENT_LENGTH
Alignment length of read. It is usually set to the
orignial read length. For example, all these cigar
strings ("101M", "68M140N33M", "53M1D48M") suggest the
read alignment length is 101. [required]
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s). [required]
-n READ_NUMBER, --read-num=READ_NUMBER
Number of aligned reads with deletions used to
calculate the deletion profile. default=1000000
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality. default=30
> insertion_profile.py -h
$ insertion_profile.py -h
Usage: insertion_profile.py [options]
Calculate the distributions of inserted nucleotides across reads
Note CIGAR strings within SAM/BAM file should have 'I' operation
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s).
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be considered as "uniquely mapped".
default=30
-s LAYOUT, --sequencing=LAYOUT
Sequencing layout. "SE"(single-end) or "PE"(pair-end).
(p
> divide_bam.py -h
$ divide_bam.py -h
Usage: divide_bam.py [options]
Equally divide BAM file (m alignments) into n parts. Each part contains roughly m/n alignments
that are randomly sampled from total alignments.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM format. BAM file should be
sorted and indexed.
-n SUBSET_NUM, --subset-num=SUBSET_NUM
Number of small BAM files
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output BAM files. Output "Prefix_num.bam".
-s, --skip-unmap Skip unmapped reads.
> geneBody_coverage.py -h
$ geneBody_coverage.py -h
Usage: geneBody_coverage.py [options]
Calculate the RNA-seq reads coverage over gene body.
Note:
1) Only input sorted and indexed BAM file(s). SAM format is not supported.
2) Genes/transcripts with mRNA length < 100 will be skipped (Number specified to "-l" cannot be < 100).
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILES, --input=INPUT_FILES
Input file(s) in BAM format. "-i" takes these input:
1) a single BAM file. 2) "," separated BAM files. 3)
directory containing one or more bam files. 4) plain
text file containing the path of one or more bam file
(Each row is a BAM file path). All BAM files should be
sorted and indexed using samtools.
-r REF_GENE_MODEL, --refgene=REF_GENE_MODEL
Reference gene model in bed format. [required]
-l MIN_MRNA_LENGTH, --minimum_length=MIN_MRNA_LENGTH
Minimum mRNA length (bp). mRNA smaller than
"min_mRNA_length" will be skipped. default=100
-f OUTPUT_FORMAT, --format=OUTPUT_FORMAT
Output file format, 'pdf', 'png' or 'jpeg'.
default=pdf
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s). [required]
> inner_distance.py -h
$ inner_distance.py -h
Usage: inner_distance.py [options]
Calculate the inner distance (insert size) of RNA-seq fragments.
RNA fragment
_________________||_________________
| |
| |
||||||||||------------------||||||||||
read_1 insert_size read_2
fragment size = read_1 + insert_size + read_2
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s)
-r REF_GENE, --refgene=REF_GENE
Reference gene model in BED format.
-k SAMPLESIZE, --sample-size=SAMPLESIZE
Number of read-pairs used to estimate inner distance.
default=1000000
-l LOWER_BOUND_SIZE, --lower-bound=LOWER_BOUND_SIZE
Lower bound of inner distance (bp). This option is
used for ploting histograme. default=-250
-u UPPER_BOUND_SIZE, --upper-bound=UPPER_BOUND_SIZE
Upper bound of inner distance (bp). This option is
used for plotting histogram. default=250
-s STEP_SIZE, --step=STEP_SIZE
Step size (bp) of histograme. This option is used for
plotting histogram. default=5
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be called "uniquely mapped". default=30
> junction_annotation.py -h
$ junction_annotation.py -h
Usage: junction_annotation.py [options]
Annotate splicing reads against gene model in two levels: reads level and juncion level.
Note:
1) A read, especially long read, can be spliced 2 or more times
2) Multiple splicing reads spanning the same intron can be consolidated into one splicing junction.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-r REF_GENE_MODEL, --refgene=REF_GENE_MODEL
Reference gene model in bed format. This file is
better to be a pooled gene model as it will be used to
annotate splicing junctions [required]
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s). [required]
-m MIN_INTRON, --min-intron=MIN_INTRON
Minimum intron length (bp). default=50 [optional]
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be considered as "uniquely mapped".
default=30
> junction_saturation.py -h
$ junction_saturation.py -h
Usage: junction_saturation.py [options]
Check if sequencing depth is saturated or not, based on the idea that when sequencing depth is
approaching saturation, less NEW junctions will be detected.
See http://rseqc.sourceforge.net/ for details.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.[required]
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s). [required]
-r REFGENE_BED, --refgene=REFGENE_BED
Reference gene model in bed fomat. This gene model is
used to determine known splicing junctions. [required]
-l PERCENTILE_LOW_BOUND, --percentile-floor=PERCENTILE_LOW_BOUND
Sampling starts from this percentile. A integer
between 0 and 100. default=5
-u PERCENTILE_UP_BOUND, --percentile-ceiling=PERCENTILE_UP_BOUND
Sampling ends at this percentile. A integer between 0
and 100. default=100
-s PERCENTILE_STEP, --percentile-step=PERCENTILE_STEP
Sampling frequency. Smaller value means more sampling
times. A integer between 0 and 100. default=5
-m MINIMUM_INTRON_SIZE, --min-intron=MINIMUM_INTRON_SIZE
Minimum intron size (bp). default=50
-v MINIMUM_SPLICE_READ, --min-coverage=MINIMUM_SPLICE_READ
Minimum number of supportting reads to call a
junction. default=1
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be called "uniquely mapped". default=30
> read_distribution.py -h
$ read_distribution.py -h
Usage: read_distribution.py [options]
Check reads distribution over exon, intron, UTR, intergenic ... etc
The following reads will be skipped:
qc_failed
PCR duplicate
Unmapped
Non-primary (or secondary)
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-r REF_GENE_MODEL, --refgene=REF_GENE_MODEL
Reference gene model in bed format.
> read_duplication.py -h
$ read_duplication.py -h
Usage: read_duplication.py [options]
Calculte reads' duplication rate.
Sequence-based: Reads with identical sequence are considered as "duplicate reads".
Mapping-based: Reads mapped to the exact same location are considered as "duplicate reads".
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s).
-u UPPER_LIMIT, --up-limit=UPPER_LIMIT
Upper limit of reads' occurrence. Only used for
plotting, default=500 (times)
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be considered as "uniquely mapped".
default=30
> read_GC.py -h
$ read_GC.py -h
Usage: read_GC.py [options]
-------------------------------------------------------------------------------------------------
Calculate distribution of reads' GC content
-------------------------------------------------------------------------------------------------
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format.
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s).
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be called "uniquely mapped". default=30
> read_NVC.py -h
$ read_NVC.py -h
Usage: read_NVC.py [options]
-------------------------------------------------------------------------------------------------
For each nucleotide position of read (5'->3'), check the nucleotide frequency. The generated R script will
gives NVC (Nucleotide Versus Cycle) plot.
-------------------------------------------------------------------------------------------------
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Input file in BAM or SAM format.[required]
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s). [required]
-x, --nx Flag option. Presense of this flag tells program to
include N,X in output NVC plot [required]
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be called "uniquely mapped". default=30
> read_quality.py -h
$ read_quality.py -h
Usage: read_quality.py [options]
-------------------------------------------------------------------------------------------------
Calculating Phred Quality Score for each position on read. Note that each read should have
the fixed (same) length
-------------------------------------------------------------------------------------------------
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format. [required]
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output files(s). [required]
-r REDUCE_FOLD, --reduce=REDUCE_FOLD
To avoid making huge vector in R, nucleotide with
particular phred score less frequent than this number
will be ignored. Increase this number save more memory
while reduce precision. Set to 1 achieves maximum
precision (i.e. every nucleotide will be considered).
This option only applies to the 'boxplot'. default=1
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be called "uniquely mapped". default=30
> RNA_fragment_size.py -h
$ RNA_fragment_size.py -h
Usage: RNA_fragment_size.py [options]
calculate fragment size for each gene/transcript. For each transcript/gene, it Will report:
1) # of fragment that was used.
2) mean of fragment size
3) median of fragment size
4) stdev of fragment size
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input=INPUT_FILE
Input BAM file
-r REFGENE_BED, --refgene=REFGENE_BED
Reference gene model in BED format. Must be strandard
12-column BED file. [required]
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be called "uniquely mapped". default=30
-n FRAGMENT_NUM, --frag-num=FRAGMENT_NUM
Minimum number of fragment. default=3
>split_paired_bam.py -h
$ split_paired_bam.py -h
Usage: split_paired_bam.py [options]
-------------------------------------------------------------------------------------------------
Split bam file (pair-end) into 2 single-end bam file
-------------------------------------------------------------------------------------------------
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Alignment file in BAM or SAM format. BAM file should
be sorted and indexed
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX
Prefix of output BAM files. "prefix.R1.bam" file
contains the 1st read, "prefix.R2.bam" file contains
the 2nd read
>tin.py
$ tin.py
Usage: tin.py [options]
-------------------------------------------------------------------------------------------------
This program calculates transcript integrity number (TIN) for each transcript (or gene) in
BED file. TIN is conceptually similar to RIN (RNA integrity number) but provides transcript
level measurement of RNA quality and is more sensitive to measure low quality RNA samples:
1) TIN score of a transcript is used to measure the RNA integrity of the transcript.
2) Median TIN score across all transcripts can be used to measure RNA integrity of that
"RNA sample".
3) TIN ranges from 0 (the worst) to 100 (the best). TIN = 60 means: 60% of the transcript
has been covered if the reads coverage were uniform.
4) TIN will be assigned to 0 if the transcript has no coverage or covered reads is fewer than
cutoff.
-------------------------------------------------------------------------------------------------
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILES, --input=INPUT_FILES
Input BAM file(s). "-i" takes these input: 1) a single
BAM file. 2) "," separated BAM files (no spaces
allowed). 3) directory containing one or more bam
files. 4) plain text file containing the path of one
or more bam files (Each row is a BAM file path). All
BAM files should be sorted and indexed using samtools.
[required]
-r REF_GENE_MODEL, --refgene=REF_GENE_MODEL
Reference gene model in BED format. Must be strandard
12-column BED file. [required]
-c MINIMUM_COVERAGE, --minCov=MINIMUM_COVERAGE
Minimum number of read mapped to a transcript.
default=10
-n SAMPLE_SIZE, --sample-size=SAMPLE_SIZE
Number of equal-spaced nucleotide positions picked
from mRNA. Note: if this number is larger than the
length of mRNA (L), it will be halved until it's
smaller than L. default=100
-s, --subtract-background
Subtract background noise (estimated from intronic
reads). Only use this option if there are substantial
intronic reads.
ラン
マニュアルからテストのbamやbed,gtfなどをダウンロードできる。ここではこのデータを使って動作を確認する。
ペアエンドのbamからfastqに変換。
bam2fq.py -i test_PairedEnd_StrandSpecific_hg19.sam -o bam2fq_out1
- -i INPUT_FILE
- -o OUTPUT_PREFIX
- -s single-end for single-end sequencing.
シングルエンドは-sをつける。
UCSCのwigToBigWigコマンドが見つかれば、出力は自動でbigwigフォーマット(UCSC解説)に変換される。オーサーが準備したscript (fetchChromSizes) をマニュアルに従い準備しておく。
#bamはsamtools sortでcoordinate sortされている必要がある。
samtools sort -m 1000000000 input.bam input.sorted.bam
samtools index input.sorted.bam
#fetchChromSizesでスクリプトでサイズを計算
fetchChromSizes hg19 >hg19.chrom.sizes
bam2wig.py -s hg19.chrom.sizes -i sample.bam -o out -u
- -s Chromosome size file. Tab or space separated text file with 2 columns: first column is chromosome name/ID, second column is chromosome size.
- -u Skip non-unique hit reads.
- -t Specified wigsum. Eg: 1,000,000,000 equals to coverage of 10 million 100nt reads. Ignore this option to disable normalization
- -q Minimum mapping quality for an alignment to be called “uniquely mapped”. default=30
- -d How read(s) were stranded during sequencing. For example: –strand=‘1++,1–,2+-,2-+’ means that this is a pair-end, strand-specific RNA-seq data, and the strand rule is: read1 mapped to ‘+’ => parental gene on ‘+’; read1 mapped to ‘-‘ => parental gene on ‘-‘; read2 mapped to ‘+’ => parental gene on ‘-‘; read2 mapped to ‘-‘ => parental gene on ‘+’. If you are not sure about the strand rule, run ‘infer_experiment.py’ default=none (Not a strand specific RNA-seq data).
bam/samのstatistics出力。
bam_stat.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam
- -q Minimum mapping quality to determine uniquely mapped read.
出力。
$ bam_stat.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam
Load BAM file ... Done
#==================================================
#All numbers are READ count
#==================================================
Total records: 44826454
QC failed: 0
Optical/PCR duplicate: 0
Non primary hits 0
Unmapped reads: 0
mapq < mapq_cut (non-unique): 0
mapq >= mapq_cut (unique): 44826454
Read-1: 28674562
Read-2: 16151892
Reads map to '+': 23884437
Reads map to '-': 20942017
Non-splice reads: 39959172
Splice reads: 4867282
Reads mapped in proper pairs: 31803590
Proper-paired reads map to different chrom:0
リードのクリップアライメントの分布。
(clipped mappingが可能なアライナーを使っている必要がある)。
clipping_profile.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -s "PE" -o out
- -s Sequencing layout. “SE”(single-end) or “PE”(pair-end).
- -q Minimum mapping quality (phred scaled) for an alignment to be considered as “uniquely mapped”. default=30
リード1の出力。
リード2も出力される。
リードのmismatch分布。
mismatch_profile.py -l 101 -i sample.bam -o out
- -l Alignment length of read. It is usually set to the orignial read length. For example, all these cigar strings (“101M”, “68M140N33M”, “53M1D48M”) suggest the read alignment length is 101. [required]
リードのdeletion分布。
deletion_profile.py -i sample.bam -l 101 -o out
- -l Alignment length of read. It is usually set to the orignial read length.
リードのinsertion分布。
insertion_profile.py -s "PE" -i test.bam -o out
- -s Sequencing layout. “SE”(single-end) or “PE”(pair-end).
ランダムにリードをサンプリングしてbamをn個に分割
(ランダムだが均一なリード数になる)。
divide_bam.py -n 3 -i SingleEnd_StrandSpecific_50mer_Human_hg19.bam -o output
- -n Number of small BAM files
遺伝子上のカバレッジ分布
geneBody_coverage.py -r hg19.housekeeping.bed -i test.bam -o output
- -r Reference gene model in bed format. [required]
複数bamなら-i test1.bam,test2.bam,test3.bam 、またはbamのディレクトリを-i /data/ という風に指定する。bigwigから計算するgeneBody_coverage2.pyもある。
ペアエンドの内側のインサートサイズを推測。
inner_distance.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output -r hg19.refseq.bed12
スプライシングジャンクションをリファレンスと比較し、比率を返す。
junction_annotation.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output -r hg19.refseq.bed12
レアなalternative splicingを探すためにデプスが十分か推測。
junction_saturation.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -r hg19.refseq.bed12 -o output
- -r Reference gene model in BED format.
feature上のリードの分布
read_distribution.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -r hg19.refseq.bed12
出力の定義はマニュアルで確認してください。
duplicateの分布。
read_duplication.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
GC含量の分布
read_GC.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
リードの各部位の塩基の分布
read_NVC.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
クオリティの分布
read_quality.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output_quality
各transcriptsごとのリードサイズ
RNA_fragment_size.py -r hg19.RefSeq.union.bed -i SRR873822_RIN10.bam > SRR873822_RIN10.fragSize
bamの分割(rRNA除去など)
split_bam.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -r hg19.rRNA.bed -o output
以下の3種類に分割される。rRNAをbedで与えれば、rRNAにマッピングされなかったリードをbamのまま抽出できる。
- XXX.in.bam: reads that are mapped to exon regions of the gene list (or reads consumed by gene list).
- XXX.ex.bam: reads that cannot be mapped the exon regions of the original gene list.
- XXX.junk.bam: qcfailed reads or unmapped reads.
ペアエンドbamからシングルエンドのbam2つに分割。
split_paired_bam.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output
TIN値からサンプルRNAの完全さを評価。
tin.py -r hg19.RefSeq.union.bed -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output
RIN値(1≦RIN≦10)は18Sおよび28SリボソームRNAの量に大きく依存したリボソームRNA完全性の尺度で、mRNA品質の直接測定ではない。実験ではリボゼロなどでrRNAを除いているため、下流のフローでRNAの完全さを分析できない。実際の状況は、「AU-リッチ配列」、「転写長」、「GC含量」、「二次構造」および「RNA-タンパク質複合体」などの因子に応じて、分解速度は転写物間で著しく異なることがある(ref:http://www.illumina.com/documents/products/technotes/technote-truseq-rna-access.pdf)。 これらの限界を克服するために、転写レベルでRNAの完全性を測定できるアルゴリズムTIN(リンク)が考案された。 TINは、発現された各転写物のスコア(0≦TIN≦100)を計算する。また全transcriptsの平均のmedTINはサンプル間のRNA完全性を比較するために使用できる。
ほかにも定量してFPKMを出すFPKM_count.pyや、RNA seqの種類を推測するinfer_experiment.py (strand specificかなどをマッピングから推測)、bigwigをノーマライズするnormalize_bigwig.py、2つのbigwigファイルを操作するoverlay_bigwig.py、ヘキサマーの頻度をリファレンスとリードで調べるread_hexamer.pyがある。
引用
RSeQC: quality control of RNA-seq experiments.
Wang L, Wang S, Li W.
Bioinformatics. 2012 Aug 15;28(16):2184-5.