SNVやRNA edittingに適した高精度なRNA seqのアライナー RASER

RNAのシーケンスデータをゲノムにアライメントする場合、イントロンを跨いでリードをアライメントする必要があるため、リードをsplitしてアライメントできるアライナーが使われる（真核生物のRNA seq）。イントロンは数十kbもある可能性があるので、split-alignment可能なサイズはかなり大きなサイズがデフォルトで設定されることが多い。ただしその幅は独断的にきめられたものが多く、想定外のアライメントによってdiscordant read pairが出る可能性がある。また、イントロンを跨ぐため、エキソンーイントロンの配列が曖昧な時などスプライシングジャンクション付近でエラーを起こしやすい（偶然配列がマッチする時など）。RNA editingサイトを検出すると、スプライシングジャンクション付近でfalse callが出やすいことはこれを裏付けている。対策として、予め決められたスプライシングサイトのデータベースを使いアライメントを行う方法論もあるが、この方式はスプライシングバリアントに対応しにくく、現在でもRNA のアライメントには課題が残っている。

RASERはSTARやGSNAPなどの競合に比べて、より高いprecisionをもつRNAのアライナー。競合よりRNA edittingサイトの検出感度が高いとされる。

公式ページ (マニュアルもあり）

https://www.ibp.ucla.edu/research/xiao/RASER.html

インストール

公式サイトからバイナリをダウンロードする。実行権をつけてパスを通す(mac OSは不可)

。

> raser #ヘルプ

Usage:

raser [-x index] {-i read | -i1 read_end1 -i2 read_end2} [-o result] [options]*

index index file. If not specified, RASER finds 'a.ridx' in the current directory

read read file in fastq or fasta format (single end)

read_end1 read end1 file in fastq or fasta format (paired end)

read_end2 read end2 file in fastq or fasta format (paired end)

result result file in standard sam format. Default = a.sam

Options:

-m <float> maximum ratio of mismatches (including indels). Default = 0.05 (should be <= 0.3)

-g doesn't map spliced reads. Defaultly not set.

-p <int> reports <int> number of good alignments.

Set 0 to get all good alignments.

Set 1 to get best alignment. Default = 0

-l <int> maximum length of insertion (usually equal to maximum intron length).

Default = 200,000

-v prints version information and exit.

-h prints this usage message.

Advanced options:

-d <float> applies double filtering. Double filtering is mapping scheme that a read (pair)

should be uniquely mapped with less than -m <float> threshold, and not mapped to

anywhere with more than -d <float> threshold (refer to Bahn et al., 2012).

It aims to predict SNP or RNA editing sites accurately by minimizing false

positive mappings. Recommended value is 0.09, and should be in (m, 0.3).

If -d is set, RASER reports best(unique) result (ignoring -p <int>).

-b <float> finds the obiviously best mapping of a read (pair), whose mismatch score is less

than -m <float> and less than mismatch scores of all other mappings by -b <float>.

It aims to maximize the mapping rate and minimize the false positive mappings.

Recommanded value is 0.03, and should be in (0, 0.2) (refer to Ahn et al., 2015).

If -b is set, RASER reports best(unique) result (ignoring -p <int>).

--sanger if read quality score is based on Illumina v1.7 or less, this option will convert

quality them to Sanger or Illumina v1.8

-s <float> allows <float> * read_length of soft-clipping. Default = 0.25

--idop <int> indel open penalty. Default = 3

--idep <int> indel extend penalty. Default = 1

-t <int> number of thread. Default = 8 (use 1 for no multi-processing)

--nxm don't add XM/XJ fields to sam files. XM/XJ fields include mismatch information

and mapping range information. Defaultly not set.

--nxt don't add XT fields to sam files. TI includes the names of transcripts to which

a read (pair) was mapped, if a transcriptome was indexed. Defaultly not set.

> raserIdx #ヘルプ indexをつけるコマンド

Usage:

raserIdx -r {reference_files} [-l mRNA_table_file] [-p gene_name_prefix] [-x index_file] [options]

reference_files reference sequence (fastq format) files

mRNA_table_file mRNA annotation table from ucsc genome browser. Refer to

the manual for the detailed format of this file

gene_name_prefix if mRNA names in mRNA_table_file and reference_file are different,

should set gene_name_prefix. For example, mRNA names in

mRNA_table_file and reference_file are ENSMUST00000086465 and

mm10_ensGene_ENSMUST00000086465 respectively, gene_name_prefix

should be set as 'ENSMUST'.

index_file index file. Default name is a.ridx

options:

-s <int> window size. range = 3~10. Default = 8

-k <int> window overlap size. range = 1~6. Default = 4

-v print version information and quit

-h print help message and quit

ラン

genomeのインデックス作成

raserIdx -r input.fa -x INDEX_NAME

RNA配列だけを使ってindex作成

raserIdx -r mRNA.fa -l ANNOTATION_FILE -x INDEX_NAME

annotation fileはUCSCからダウンロードできる。詳細は公式サイトのマニュアル参照。

マッピング

raser -x index -i1 pair1.fq -i2 pair2.fq -o result

-m <float>　maximum ratio of mismatches (including indels) 0.05
-l <int>　maximum length of insertion (usually equal to maximum intron length) 200000
-d <float>　Set this to apply double filtering. Double filtering is mapping scheme that a read (pair) should be uniquely mapped with less than -m <float> threshold, and not mapped to anywhere with more than -d <float> threshold (refer to Bahn et al., 2012). It aims to predict SNP or RNA editing sites accurately, by minimizing false positives. Recommended value is 0.09, and should be in (m, 0.3). If -d is set, RASER reports best (unique) result (ignoring -p <int>). (Not set)
-b <float>　(highly recommended to set for SNP or editing site detection) Set this to finds the obviously best mapping of a read (pair), whose mismatch score is less than -m <float>, and also less than mismatch scores of all other mappings by -b <float>. It aims to maximize the mapping rate, as well as to minimize the false positive rate. Recommended value is 0.03, and should be in (0, 0.2) (Refer to Ahn et al., 2015). If -b is set, RASER reports best (unique) result (ignoring -p <int>). Not set