RNAのシーケンスデータをゲノムにアライメントする場合、イントロンを跨いでリードをアライメントする必要があるため、リードをsplitしてアライメントできるアライナーが使われる(真核生物のRNA seq)。イントロンは数十kbもある可能性があるので、split-alignment可能なサイズはかなり大きなサイズがデフォルトで設定されることが多い。ただしその幅は独断的にきめられたものが多く、想定外のアライメントによってdiscordant read pairが出る可能性がある。また、イントロンを跨ぐため、エキソンーイントロンの配列が曖昧な時などスプライシングジャンクション付近でエラーを起こしやすい(偶然配列がマッチする時など)。RNA editingサイトを検出すると、スプライシングジャンクション付近でfalse callが出やすいことはこれを裏付けている。対策として、予め決められたスプライシングサイトのデータベースを使いアライメントを行う方法論もあるが、この方式はスプライシングバリアントに対応しにくく、現在でもRNA のアライメントには課題が残っている。
RASERはSTARやGSNAPなどの競合に比べて、より高いprecisionをもつRNAのアライナー。競合よりRNA edittingサイトの検出感度が高いとされる。
公式ページ (マニュアルもあり)
https://www.ibp.ucla.edu/research/xiao/RASER.html
インストール
公式サイトからバイナリをダウンロードする。実行権をつけてパスを通す(mac OSは不可)
。
> raser #ヘルプ
Usage:
raser [-x index] {-i read | -i1 read_end1 -i2 read_end2} [-o result] [options]*
index index file. If not specified, RASER finds 'a.ridx' in the current directory
read read file in fastq or fasta format (single end)
read_end1 read end1 file in fastq or fasta format (paired end)
read_end2 read end2 file in fastq or fasta format (paired end)
result result file in standard sam format. Default = a.sam
Options:
-m <float> maximum ratio of mismatches (including indels). Default = 0.05 (should be <= 0.3)
-g doesn't map spliced reads. Defaultly not set.
-p <int> reports <int> number of good alignments.
Set 0 to get all good alignments.
Set 1 to get best alignment. Default = 0
-l <int> maximum length of insertion (usually equal to maximum intron length).
Default = 200,000
-v prints version information and exit.
-h prints this usage message.
Advanced options:
-d <float> applies double filtering. Double filtering is mapping scheme that a read (pair)
should be uniquely mapped with less than -m <float> threshold, and not mapped to
anywhere with more than -d <float> threshold (refer to Bahn et al., 2012).
It aims to predict SNP or RNA editing sites accurately by minimizing false
positive mappings. Recommended value is 0.09, and should be in (m, 0.3).
If -d is set, RASER reports best(unique) result (ignoring -p <int>).
-b <float> finds the obiviously best mapping of a read (pair), whose mismatch score is less
than -m <float> and less than mismatch scores of all other mappings by -b <float>.
It aims to maximize the mapping rate and minimize the false positive mappings.
Recommanded value is 0.03, and should be in (0, 0.2) (refer to Ahn et al., 2015).
If -b is set, RASER reports best(unique) result (ignoring -p <int>).
--sanger if read quality score is based on Illumina v1.7 or less, this option will convert
quality them to Sanger or Illumina v1.8
-s <float> allows <float> * read_length of soft-clipping. Default = 0.25
--idop <int> indel open penalty. Default = 3
--idep <int> indel extend penalty. Default = 1
-t <int> number of thread. Default = 8 (use 1 for no multi-processing)
--nxm don't add XM/XJ fields to sam files. XM/XJ fields include mismatch information
and mapping range information. Defaultly not set.
--nxt don't add XT fields to sam files. TI includes the names of transcripts to which
a read (pair) was mapped, if a transcriptome was indexed. Defaultly not set.
> raserIdx #ヘルプ indexをつけるコマンド
Usage:
raserIdx -r {reference_files} [-l mRNA_table_file] [-p gene_name_prefix] [-x index_file] [options]
reference_files reference sequence (fastq format) files
mRNA_table_file mRNA annotation table from ucsc genome browser. Refer to
the manual for the detailed format of this file
gene_name_prefix if mRNA names in mRNA_table_file and reference_file are different,
should set gene_name_prefix. For example, mRNA names in
mRNA_table_file and reference_file are ENSMUST00000086465 and
mm10_ensGene_ENSMUST00000086465 respectively, gene_name_prefix
should be set as 'ENSMUST'.
index_file index file. Default name is a.ridx
options:
-s <int> window size. range = 3~10. Default = 8
-k <int> window overlap size. range = 1~6. Default = 4
-v print version information and quit
-h print help message and quit
ラン
genomeのインデックス作成
raserIdx -r input.fa -x INDEX_NAME
RNA配列だけを使ってindex作成
raserIdx -r mRNA.fa -l ANNOTATION_FILE -x INDEX_NAME
annotation fileはUCSCからダウンロードできる。詳細は公式サイトのマニュアル参照。
raser -x index -i1 pair1.fq -i2 pair2.fq -o result
- -m <float> maximum ratio of mismatches (including indels) 0.05
- -l <int> maximum length of insertion (usually equal to maximum intron length) 200000
- -d <float> Set this to apply double filtering. Double filtering is mapping scheme that a read (pair) should be uniquely mapped with less than -m <float> threshold, and not mapped to anywhere with more than -d <float> threshold (refer to Bahn et al., 2012). It aims to predict SNP or RNA editing sites accurately, by minimizing false positives. Recommended value is 0.09, and should be in (m, 0.3). If -d is set, RASER reports best (unique) result (ignoring -p <int>). (Not set)
- -b <float> (highly recommended to set for SNP or editing site detection) Set this to finds the obviously best mapping of a read (pair), whose mismatch score is less than -m <float>, and also less than mismatch scores of all other mappings by -b <float>. It aims to maximize the mapping rate, as well as to minimize the false positive rate. Recommended value is 0.03, and should be in (0, 0.2) (Refer to Ahn et al., 2015). If -b is set, RASER reports best (unique) result (ignoring -p <int>). Not set
引用
RASER: reads aligner for SNPs and editing sites of RNA.
Ahn J, Xiao X.
Bioinformatics. 2015 Dec 15;31(24):3906-13. doi: 10.1093/bioinformatics/btv505. Epub 2015 Aug 30.