macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

RNA seqのクオリティコントロールツール RSeQC

2019 12/2 インストール追記

2020 12/9 誤字修正, help追加

2022 インストール手順追記

 

RSeQCはクオリティ、GCバイアス、PCRバイアス、ヌクレオチド組成バイアス、シーケンスのデプス、strandの特異性、カバレッジ均一性およびゲノムのfeature上のカバレッジ分布など、RNA-seq実験を総合的に評価するパッケージ。SAMとBAMを入力として利用できる。可視化にはRを利用している。

 

マニュアル

http://rseqc.sourceforge.net

 

インストール

依存

  • gcc
  • python2.7
  • numpy
  • R
#bioconda(link)
mamba create -n RSeQC -y
conda activate RSeQC
mamba install -c bioconda -y rseqc -y

#pip
pip install RSeQC

bam2fq.py -h 

$ bam2fq.py -h

Usage: bam2fq.py [options]

 

Description: Convert alignments in BAM or SAM format into fastq format.

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output fastq files(s).

  -s, --single-end      Specificy '-s' or '--single-end' for single-end

                        sequencing.

  -c, --compress        Specificy '-c' or '--compress' to compress output

                        fastq file(s) using 'gzip' command.

bam2wig.py -h

$ bam2wig.py -h

Usage: bam2wig.py [options]

 

Convert BAM file into wig file. BAM file must be sorted and indexed using SAMtools.

Note: SAM format file is not supported.

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM format. BAM file must be sorted

                        and indexed using samTools. .bam and .bai files should

                        be placed in the same directory.

  -s CHROMSIZE, --chromSize=CHROMSIZE

                        Chromosome size file. Tab or space separated text file

                        with 2 columns: first column is chromosome name/ID,

                        second column is chromosome size. Chromosome name

                        (such as "chr1") should be consistent between this

                        file and the BAM file.

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output wiggle files(s). One wiggle file will

                        be generated for non strand-specific data, two wiggle

                        files ("Prefix_Forward.wig" and "Prefix_Reverse.wig")

                        will be generated for strand-specific RNA-seq data.

  -t TOTAL_WIGSUM, --wigsum=TOTAL_WIGSUM

                        Specified wigsum. Eg: 1,000,000,000 equals to coverage

                        of 10 million 100nt reads. Ignore this option to

                        disable normalization

  -u, --skip-multi-hits

                        Skip non-unique hit reads.

  -d STRAND_RULE, --strand=STRAND_RULE

                        How read(s) were stranded during sequencing. For

                        example: --strand='1++,1--,2+-,2-+' means that this is

                        a pair-end, strand-specific RNA-seq data, and the

                        strand rule is: read1 mapped to '+' => parental gene

                        on '+'; read1 mapped to '-' => parental gene on '-';

                        read2 mapped to '+' => parental gene on '-'; read2

                        mapped to '-' => parental gene on '+'.  If you are not

                        sure about the strand rule, run 'infer_experiment.py'

                        default=none (Not a strand specific RNA-seq data).

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality to determine "uniquely

                        mapped". default=30

bam_stat.py -h

$ bam_stat.py -h

Usage: bam_stat.py [options]

 

Summarizing mapping statistics of a BAM or SAM file. 

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) to determine

                        "uniquely mapped" reads. default=30

mismatch_profile.py -h

$ mismatch_profile.py -h

Usage: mismatch_profile.py [options]

 

Calculate the distribution of mismatches across reads. Note that the "MD" tag must exist in BAM file.

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_BAM, --input=INPUT_BAM

                        Input BAM file. [required]

  -l READ_ALIGNMENT_LENGTH, --read-align-length=READ_ALIGNMENT_LENGTH

                        Alignment length of read. It is usually set to the

                        orignial read length. For example, all these cigar

                        strings ("101M", "68M140N33M", "53M1D48M") suggest the

                        read alignment length is 101. [required]

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s). [required]

  -n READ_NUMBER, --read-num=READ_NUMBER

                        Number of aligned reads with mismatches used to

                        calculate the mismatch profile. default=1000000

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality. default=30

clipping_profile.py -h

$ clipping_profile.py -h

Usage: clipping_profile.py [options]

 

This program is used estimate clipping profile of RNA-seq reads from BAM or SAM file

Note that to use this funciton, CIGAR strings within SAM/BAM file should have 'S' operation 

(This means your reads mapper should support clipped mapping).

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s).

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be considered as "uniquely mapped".

                        default=30

  -s LAYOUT, --sequencing=LAYOUT

                        Sequencing layout. "SE"(single-end) or "PE"(pair-end).

deletion_profile.py -h

$ deletion_profile.py -h

Usage: deletion_profile.py [options]

 

Calculate the distributions of deleted nucleotides across reads. 

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_BAM, --input=INPUT_BAM

                        Input BAM file. [required]

  -l READ_ALIGNMENT_LENGTH, --read-align-length=READ_ALIGNMENT_LENGTH

                        Alignment length of read. It is usually set to the

                        orignial read length. For example, all these cigar

                        strings ("101M", "68M140N33M", "53M1D48M") suggest the

                        read alignment length is 101. [required]

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s). [required]

  -n READ_NUMBER, --read-num=READ_NUMBER

                        Number of aligned reads with deletions used to

                        calculate the deletion profile. default=1000000

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality. default=30

insertion_profile.py -h

$ insertion_profile.py -h

Usage: insertion_profile.py [options]

 

Calculate the distributions of inserted nucleotides across reads

Note CIGAR strings within SAM/BAM file should have 'I' operation 

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s).

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be considered as "uniquely mapped".

                        default=30

  -s LAYOUT, --sequencing=LAYOUT

                        Sequencing layout. "SE"(single-end) or "PE"(pair-end).

(p

divide_bam.py -h

$ divide_bam.py -h

Usage: divide_bam.py [options]

 

Equally divide BAM file (m alignments) into n parts. Each part contains roughly m/n alignments

that are randomly sampled from total alignments. 

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM format. BAM file should be

                        sorted and indexed.

  -n SUBSET_NUM, --subset-num=SUBSET_NUM

                        Number of small BAM files

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output BAM files. Output "Prefix_num.bam".

  -s, --skip-unmap      Skip unmapped reads.

geneBody_coverage.py -h

$ geneBody_coverage.py -h

Usage: geneBody_coverage.py [options]

 

Calculate the RNA-seq reads coverage over gene body. 

 

Note:

1) Only input sorted and indexed BAM file(s). SAM format is not supported.

2) Genes/transcripts with mRNA length < 100 will be skipped (Number specified to "-l" cannot be < 100). 

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILES, --input=INPUT_FILES

                        Input file(s) in BAM format. "-i" takes these input:

                        1) a single BAM file. 2) "," separated BAM files. 3)

                        directory containing one or more bam files. 4) plain

                        text file containing the path of one or more bam file

                        (Each row is a BAM file path). All BAM files should be

                        sorted and indexed using samtools.

  -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL

                        Reference gene model in bed format. [required]

  -l MIN_MRNA_LENGTH, --minimum_length=MIN_MRNA_LENGTH

                        Minimum mRNA length (bp). mRNA smaller than

                        "min_mRNA_length" will be skipped. default=100

  -f OUTPUT_FORMAT, --format=OUTPUT_FORMAT

                        Output file format, 'pdf', 'png' or 'jpeg'.

                        default=pdf

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s). [required]

inner_distance.py -h

$ inner_distance.py -h

Usage: inner_distance.py [options]

 

Calculate the inner distance (insert size)  of RNA-seq fragments. 

 

               RNA fragment

 _________________||_________________

|                                    |

|                                    |

||||||||||------------------||||||||||

  read_1      insert_size     read_2

 

fragment size = read_1 + insert_size + read_2

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s)

  -r REF_GENE, --refgene=REF_GENE

                        Reference gene model in BED format.

  -k SAMPLESIZE, --sample-size=SAMPLESIZE

                        Number of read-pairs used to estimate inner distance.

                        default=1000000

  -l LOWER_BOUND_SIZE, --lower-bound=LOWER_BOUND_SIZE

                        Lower bound of inner distance (bp). This option is

                        used for ploting histograme. default=-250

  -u UPPER_BOUND_SIZE, --upper-bound=UPPER_BOUND_SIZE

                        Upper bound of inner distance (bp). This option is

                        used for plotting histogram. default=250

  -s STEP_SIZE, --step=STEP_SIZE

                        Step size (bp) of histograme. This option is used for

                        plotting histogram. default=5

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be called "uniquely mapped". default=30

junction_annotation.py -h

$ junction_annotation.py -h

Usage: junction_annotation.py [options]

 

Annotate splicing reads against gene model in two levels: reads level and  juncion level.

Note:

1) A read, especially long read, can be spliced 2 or more times

2) Multiple splicing reads spanning the same intron can be consolidated into one splicing junction.

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL

                        Reference gene model in bed format. This file is

                        better to be a pooled gene model as it will be used to

                        annotate splicing junctions [required]

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s). [required]

  -m MIN_INTRON, --min-intron=MIN_INTRON

                        Minimum intron length (bp). default=50 [optional]

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be considered as "uniquely mapped".

                        default=30

junction_saturation.py -h

$ junction_saturation.py -h

Usage: junction_saturation.py [options]

 

Check if sequencing depth is saturated or not, based on the idea that when sequencing depth is 

approaching saturation, less NEW junctions will be detected. 

See http://rseqc.sourceforge.net/ for details.

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.[required]

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s). [required]

  -r REFGENE_BED, --refgene=REFGENE_BED

                        Reference gene model in bed fomat. This gene model is

                        used to determine known splicing junctions. [required]

  -l PERCENTILE_LOW_BOUND, --percentile-floor=PERCENTILE_LOW_BOUND

                        Sampling starts from this percentile. A integer

                        between 0 and 100. default=5

  -u PERCENTILE_UP_BOUND, --percentile-ceiling=PERCENTILE_UP_BOUND

                        Sampling ends at this percentile. A integer between 0

                        and 100. default=100

  -s PERCENTILE_STEP, --percentile-step=PERCENTILE_STEP

                        Sampling frequency. Smaller value means more sampling

                        times. A integer between 0 and 100. default=5

  -m MINIMUM_INTRON_SIZE, --min-intron=MINIMUM_INTRON_SIZE

                        Minimum intron size (bp). default=50

  -v MINIMUM_SPLICE_READ, --min-coverage=MINIMUM_SPLICE_READ

                        Minimum number of supportting reads to call a

                        junction. default=1

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be called "uniquely mapped". default=30

> read_distribution.py -h

$ read_distribution.py -h

Usage: read_distribution.py [options]

 

Check reads distribution over exon, intron, UTR, intergenic ... etc

The following reads will be skipped:

qc_failed

PCR duplicate

Unmapped

Non-primary (or secondary)

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL

                        Reference gene model in bed format.

> read_duplication.py -h

$ read_duplication.py -h

Usage: read_duplication.py [options]

 

Calculte reads' duplication rate. 

Sequence-based: Reads with identical sequence are considered as "duplicate reads".

Mapping-based: Reads mapped to the exact same location are considered as "duplicate reads".

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s).

  -u UPPER_LIMIT, --up-limit=UPPER_LIMIT

                        Upper limit of reads' occurrence. Only used for

                        plotting, default=500 (times)

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be considered as "uniquely mapped".

                        default=30

read_GC.py -h

$ read_GC.py -h

Usage: read_GC.py [options]

-------------------------------------------------------------------------------------------------

Calculate distribution of reads' GC content

-------------------------------------------------------------------------------------------------

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format.

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s).

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be called "uniquely mapped". default=30

read_NVC.py -h

$ read_NVC.py -h

Usage: read_NVC.py [options]

-------------------------------------------------------------------------------------------------

For each nucleotide  position of read (5'->3'), check the nucleotide frequency. The generated R script will

gives NVC (Nucleotide Versus Cycle) plot.

-------------------------------------------------------------------------------------------------

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Input file in BAM or SAM format.[required]

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s). [required]

  -x, --nx              Flag option. Presense of this flag tells program to

                        include N,X in output NVC plot [required]

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be called "uniquely mapped". default=30

read_quality.py -h

$ read_quality.py -h

Usage: read_quality.py [options]

-------------------------------------------------------------------------------------------------

Calculating Phred Quality Score for each position on read. Note that each read should have 

the fixed (same) length

-------------------------------------------------------------------------------------------------

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format. [required]

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output files(s). [required]

  -r REDUCE_FOLD, --reduce=REDUCE_FOLD

                        To avoid making huge vector in R, nucleotide with

                        particular phred score less frequent than this number

                        will be ignored. Increase this number save more memory

                        while reduce precision. Set to 1 achieves maximum

                        precision (i.e. every nucleotide will be considered).

                        This option only applies to the 'boxplot'. default=1

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be called "uniquely mapped". default=30

RNA_fragment_size.py -h

$ RNA_fragment_size.py -h

Usage: RNA_fragment_size.py [options]

 

calculate fragment size for each gene/transcript. For each transcript/gene, it Will report:

1) # of fragment that was used.

2) mean of fragment size

3) median of fragment size

4) stdev of fragment size

 

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input=INPUT_FILE

                        Input BAM file

  -r REFGENE_BED, --refgene=REFGENE_BED

                        Reference gene model in BED format. Must be strandard

                        12-column BED file. [required]

  -q MAP_QUAL, --mapq=MAP_QUAL

                        Minimum mapping quality (phred scaled) for an

                        alignment to be called "uniquely mapped". default=30

  -n FRAGMENT_NUM, --frag-num=FRAGMENT_NUM

                        Minimum number of fragment. default=3

>split_paired_bam.py -h

$ split_paired_bam.py -h

Usage: split_paired_bam.py [options]

-------------------------------------------------------------------------------------------------

Split bam file (pair-end) into 2 single-end bam file

-------------------------------------------------------------------------------------------------

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILE, --input-file=INPUT_FILE

                        Alignment file in BAM or SAM format. BAM file should

                        be sorted and indexed

  -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX

                        Prefix of output BAM files. "prefix.R1.bam" file

                        contains the 1st read, "prefix.R2.bam" file contains

                        the 2nd read

>tin.py

$ tin.py

Usage: tin.py [options]

-------------------------------------------------------------------------------------------------

This program calculates transcript integrity number (TIN) for each transcript (or gene) in

BED file. TIN is conceptually similar to RIN (RNA integrity number) but provides transcript

level measurement of RNA quality and is more sensitive to measure low quality RNA samples:

 

1) TIN score of a transcript is used to measure the RNA integrity of the transcript.

2) Median TIN score across all transcripts can be used to measure RNA integrity  of that

   "RNA sample".

3) TIN ranges from 0 (the worst) to 100 (the best). TIN = 60 means: 60% of the transcript

   has been covered if the reads coverage were uniform.

4) TIN will be assigned to 0 if the transcript has no coverage or covered reads is fewer than

   cutoff.

-------------------------------------------------------------------------------------------------

 

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -i INPUT_FILES, --input=INPUT_FILES

                        Input BAM file(s). "-i" takes these input: 1) a single

                        BAM file. 2) "," separated BAM files (no spaces

                        allowed). 3) directory containing one or more bam

                        files. 4) plain text file containing the path of one

                        or more bam files (Each row is a BAM file path). All

                        BAM files should be sorted and indexed using samtools.

                        [required]

  -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL

                        Reference gene model in BED format. Must be strandard

                        12-column BED file. [required]

  -c MINIMUM_COVERAGE, --minCov=MINIMUM_COVERAGE

                        Minimum number of read mapped to a transcript.

                        default=10

  -n SAMPLE_SIZE, --sample-size=SAMPLE_SIZE

                        Number of equal-spaced nucleotide positions picked

                        from mRNA. Note: if this number is larger than the

                        length of mRNA (L), it will be halved until it's

                        smaller than L. default=100

  -s, --subtract-background

                        Subtract background noise (estimated from intronic

                        reads). Only use this option if there are substantial

                        intronic reads.

 

ラン

マニュアルからテストのbamやbed,gtfなどをダウンロードできる。ここではこのデータを使って動作を確認する。

 

ペアエンドのbamからfastqに変換。

bam2fq.py -i test_PairedEnd_StrandSpecific_hg19.sam -o bam2fq_out1
  • -i    INPUT_FILE
  • -o   OUTPUT_PREFIX
  • -s   single-end for single-end sequencing. 

シングルエンドは-sをつける。

 

bamからwiggleフォーマット(UCSC解説)に変換。

UCSCのwigToBigWigコマンドが見つかれば、出力は自動でbigwigフォーマット(UCSC解説)に変換される。オーサーが準備したscript (fetchChromSizes) をマニュアルに従い準備しておく。

#bamはsamtools sortでcoordinate sortされている必要がある。
samtools sort -m 1000000000 input.bam input.sorted.bam
samtools index input.sorted.bam

#fetchChromSizesでスクリプトでサイズを計算
fetchChromSizes hg19 >hg19.chrom.sizes
bam2wig.py -s hg19.chrom.sizes -i sample.bam -o out -u
  • -s    Chromosome size file. Tab or space separated text file with 2 columns: first column is chromosome name/ID, second column is chromosome size. 
  • -u    Skip non-unique hit reads.
  • -t    Specified wigsum. Eg: 1,000,000,000 equals to coverage of 10 million 100nt reads. Ignore this option to disable normalization
  • -q    Minimum mapping quality for an alignment to be called “uniquely mapped”. default=30
  • -d   How read(s) were stranded during sequencing. For example: –strand=‘1++,1–,2+-,2-+’ means that this is a pair-end, strand-specific RNA-seq data, and the strand rule is: read1 mapped to ‘+’ => parental gene on ‘+’; read1 mapped to ‘-‘ => parental gene on ‘-‘; read2 mapped to ‘+’ => parental gene on ‘-‘; read2 mapped to ‘-‘ => parental gene on ‘+’. If you are not sure about the strand rule, run ‘infer_experiment.py’ default=none (Not a strand specific RNA-seq data).

 

bam/samのstatistics出力。

bam_stat.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam 
  • -q    Minimum mapping quality to determine uniquely mapped read.

出力。

$ bam_stat.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam

Load BAM file ...  Done

 

#==================================================

#All numbers are READ count

#==================================================

 

Total records:                          44826454

 

QC failed:                              0

Optical/PCR duplicate:                  0

Non primary hits                        0

Unmapped reads:                         0

mapq < mapq_cut (non-unique):           0

 

mapq >= mapq_cut (unique):              44826454

Read-1:                                 28674562

Read-2:                                 16151892

Reads map to '+':                       23884437

Reads map to '-':                       20942017

Non-splice reads:                       39959172

Splice reads:                           4867282

Reads mapped in proper pairs:           31803590

Proper-paired reads map to different chrom:0

 

 

リードのクリップアライメントの分布。

(clipped mappingが可能なアライナーを使っている必要がある)。

clipping_profile.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -s "PE" -o out
  • -s    Sequencing layout. “SE”(single-end) or “PE”(pair-end).
  • -q    Minimum mapping quality (phred scaled) for an alignment to be considered as “uniquely mapped”. default=30

リード1の出力。

f:id:kazumaxneo:20180227225932j:plain

リード2も出力される。

 

 リードのmismatch分布。

mismatch_profile.py -l 101 -i sample.bam -o out
  • -l    Alignment length of read. It is usually set to the orignial read length. For example, all these cigar strings (“101M”, “68M140N33M”, “53M1D48M”) suggest the read alignment length is 101. [required]

 

リードのdeletion分布。

deletion_profile.py -i sample.bam -l 101 -o out
  • -l     Alignment length of read. It is usually set to the orignial read length.

 

 リードのinsertion分布。

insertion_profile.py -s "PE" -i test.bam -o out 
  • -s    Sequencing layout. “SE”(single-end) or “PE”(pair-end).

f:id:kazumaxneo:20180227234729j:plain

 

ランダムにリードをサンプリングしてbamをn個に分割

(ランダムだが均一なリード数になる)。

divide_bam.py -n 3 -i SingleEnd_StrandSpecific_50mer_Human_hg19.bam -o output
  • -n    Number of small BAM files

 

遺伝子上のカバレッジ分布

geneBody_coverage.py -r hg19.housekeeping.bed -i test.bam -o output
  • -r    Reference gene model in bed format. [required]

複数bamなら-i test1.bam,test2.bam,test3.bam 、またはbamのディレクトリを-i /data/ という風に指定する。bigwigから計算するgeneBody_coverage2.pyもある。

f:id:kazumaxneo:20180227233520j:plain

 

 

ペアエンドの内側のインサートサイズを推測。

f:id:kazumaxneo:20180227234231j:plain

inner_distance.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output -r hg19.refseq.bed12

f:id:kazumaxneo:20180301124426j:plain

 

スプライシングジャンクションをリファレンスと比較し、比率を返す。

junction_annotation.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output -r hg19.refseq.bed12
  • f:id:kazumaxneo:20180228200527j:plain

f:id:kazumaxneo:20180227235016j:plain

 

 

レアなalternative splicingを探すためにデプスが十分か推測。

junction_saturation.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -r hg19.refseq.bed12 -o output
  • -r    Reference gene model in BED format.

f:id:kazumaxneo:20180227235353j:plain

 

feature上のリードの分布

read_distribution.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -r hg19.refseq.bed12

出力の定義はマニュアルで確認してください。

f:id:kazumaxneo:20180301124530j:plain

 

duplicateの分布。

read_duplication.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output

 

 GC含量の分布

read_GC.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output

  

f:id:kazumaxneo:20180228202855j:plain

リードの各部位の塩基の分布

read_NVC.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output

 

クオリティの分布

read_quality.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output_quality

f:id:kazumaxneo:20180301001942j:plain

f:id:kazumaxneo:20180301001944j:plain

 

各transcriptsごとのリードサイズ

RNA_fragment_size.py -r hg19.RefSeq.union.bed -i SRR873822_RIN10.bam > SRR873822_RIN10.fragSize

 

bamの分割(rRNA除去など)

split_bam.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -r hg19.rRNA.bed -o output

 以下の3種類に分割される。rRNAをbedで与えれば、rRNAにマッピングされなかったリードをbamのまま抽出できる。

  1. XXX.in.bam: reads that are mapped to exon regions of the gene list (or reads consumed by gene list).
  2. XXX.ex.bam: reads that cannot be mapped the exon regions of the original gene list.
  3. XXX.junk.bam: qcfailed reads or unmapped reads.

 

ペアエンドbamからシングルエンドのbam2つに分割。

split_paired_bam.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output

 

TIN値からサンプルRNAの完全さを評価。

tin.py -r hg19.RefSeq.union.bed -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output

RIN値(1≦RIN≦10)は18Sおよび28SリボソームRNAの量に大きく依存したリボソームRNA完全性の尺度で、mRNA品質の直接測定ではない。実験ではリボゼロなどでrRNAを除いているため、下流のフローでRNAの完全さを分析できない。実際の状況は、「AU-リッチ配列」、「転写長」、「GC含量」、「二次構造」および「RNA-タンパク質複合体」などの因子に応じて、分解速度は転写物間で著しく異なることがある(ref:http://www.illumina.com/documents/products/technotes/technote-truseq-rna-access.pdf)。 これらの限界を克服するために、転写レベルでRNAの完全性を測定できるアルゴリズムTIN(リンク)が考案された。 TINは、発現された各転写物のスコア(0≦TIN≦100)を計算する。また全transcriptsの平均のmedTINはサンプル間のRNA完全性を比較するために使用できる。

 

 

ほかにも定量してFPKMを出すFPKM_count.pyや、RNA seqの種類を推測するinfer_experiment.py (strand specificかなどをマッピングから推測)、bigwigをノーマライズするnormalize_bigwig.py、2つのbigwigファイルを操作するoverlay_bigwig.py、ヘキサマーの頻度をリファレンスとリードで調べるread_hexamer.pyがある。

 

引用

RSeQC: quality control of RNA-seq experiments.

Wang L, Wang S, Li W.

Bioinformatics. 2012 Aug 15;28(16):2184-5.