スプライシングジャンクションを上手く処理できるエラーの多いロングリードRNA seqのアライナーdeSALT

2019 12/17 論文追記

　RNAシークエンシングはトランスクリプトームを特徴付けるための基本的なアプローチとなっている。正確な遺伝子構造を明らかにし、遺伝子/転写産物の発現を定量できる[ref.1-5]、さらにバリアントコーリング[ref.6]、RNA edit/ng解析[ref.7 - 8]、遺伝子融合検出[9] 10]が可能である。しかしながら、限られたリード長やライブラリ調製からのシステマチックなバイアスなどのショートリードシーケンシングの欠点のために、リードを正確にアラインメントし[11]、遺伝子アイソフォームを再構築し[12]、そして転写産物を定量することは依然としてトランスクリプトーム研究のボトルネックとなっている。
　Pacific Biosciences（PacBio）の1分子リアルタイム（SMRT）シークエンシング[ref.13]とOxford Nanopore Technologies（ONT）のナノポアシークエンシング[ref.14]の2種類の画期的なロングリードシークエンシング技術が出現し、これはトランスクリプトーム解析におけるショートリードのボトルネックを回避できる。両方ともはるかに長いリード生成を可能にする、すなわち、平均リード長は10kbを超え、最大リード長は数百kbを超える［ref. 15, 16］。この利点を利用して、全長転写物を単一のリードによってシーケンシングでき、これは遺伝子アイソフォーム再構築の精度を実質的に改善することが期待されている。さらに、シークエンシング手順におけるシステマチックなバイアスも少なく[ref.17]、これは遺伝子/転写産物の発現定量にも有益である。
利点に加えて、PacBioとONTのリードは、ショートリードよりもはるかに高いシーケンスエラー率を持っている。PacBio SMRTシークエンシングでは、rawリード（「サブリード」）のシークエンシングエラー率は約10％-20％である[ref.16]。 ONTナノポアシーケンシングの場合、1Dおよび2D（1D2としても知られる）リードのシーケンシングエラー率は、それぞれ約25％および12％である[ref.18、19]。 PacBio SMRTプラットフォームでは、環状フラグメントを複数回シーケンスしてシーケンスエラーを大幅に減らすことで、目的のリード（ROI）を生成することもできるが、このテクノロジではシーケンスの歩留まりが大幅に低下し、リード長が短くなる。シーケンシングエラーは、RNA seqデータ分析に新たな技術的課題を投げかける。リードアラインメントは最も影響を受けるものである可能性があり、そしてそれは多くの下流の分析にとって基本的であるので、影響はリードアラインメントそれ自体に限定されないかもしれない。
以前の研究[ref.20-22]は、ノイズの多いDNA-seqロングリードアラインメントは重要なタスクであることを実証している。深刻なシークエンシングエラー、潜在的なゲノム変異、長いリード長など、多くの技術的問題をうまく処理する必要がある。RNA seqのロングリードアラインメントについては、アライナは上記の問題以外にも多数のスプライシングイベントに対処しなければならないので、その作業はさらに困難である。これはアライナが多くのスプライシングジャンクション部を正しく認識しそして対応するエクソンに塩基をマッピングするために非常に複雑なスプリットアラインメント（「スプライスアラインメント」とも呼ばれる）を実施する強い能力を有することを必要とする。提案されているDNA配列ロングリードアラインメントアプローチのほとんどは、ゲノム構造変異（SV）を処理するためにスプリットアラインメントを実行する能力を有するが[ref.21-23]、スプライシングジャンクション部ははるかに頻繁に起こり、エクソンの長さははるかに短く分岐している。調整されたアルゴリズムはまだ要求がある。

　BBMap [ref.24]、GMAP [ref.25]、STAR [ref.26]、BLAT [ref.27]およびMinimap2 [28]など、RNA seqロングリードアラインメントを支持するいくつかのアプローチが存在する。これらのアプローチはすべて一般的に使用されているシードアンドエクステンド戦略に基づいており、RNA seqロングリードアライメントの問題に対処するためにさまざまなシーディングおよびエクステンション方法が実装されている。これらのアプローチの全てはスプライシングジャンクション部を取り扱う能力を有する。ただし、これらのアルゴリズムのほとんどは比較的低速である[ref.28]。これは、seed処理ステップでのショートマッチが莫大に出ることと、extensionステップでの時間のかかるローカルアライメントに原因がある。さらに、いくつかのアルゴリズムはまた感度が低い［ref.29］。すなわち、多くのリードがアライメントしていないかまたは部分的にアライメントしているだけである。優れたアルゴリズムはMinimap2（紹介）である。これは、他の最先端のアライナよりも数十倍速いスピードと、同等またはそれ以上の感度を同時に達成ししている。これは主に、よく設計されたminimizer-basedのインデックス作成[ref.30]およびSSEベースのローカルアラインメント方法[ref.31]の恩恵を受けている。
　絶対的に言えば、このタスクの最終的な目標は、すべてのリードについて、すべての塩基を正しくマップすることである。ただし、いくつかの点で、これは依然として最先端のアライナにとって自明ではない可能性がある。一つは比較的短いエクソン、例えばわずか数十bpのエクソンの塩基のアライメントである。深刻なシーケンシングエラーおよび潜在的な変異の状況下では、そのようなショートエクソンからリード部分にシードを見つけることは極めて困難であり、そのためリード部分は通常アライメントされていないか誤ってアライメントされている。他の問題は、スプライシングジャンクション部分付近の塩基を正しくアライメントさせることが困難であるということである。この問題は、ショートRNA seqのリードアラインメントにも存在する。しかし、ノイズの多いロングリードRNAシーケンスのリードのアラインメントではより深刻になる。さらに、シーケンシングエラーの影響を受けて、同じ遺伝子アイソフォームからのリードのアラインメントは通常互いに分岐しており、これも下流の分析に誤解を招く。
　ここでは、ロングトランスクリプトームリードのためのde Bruijnグラフベーススプライスアライナ（deSALT）を提案する。 deSALTは、de Bruijnグラフベースのインデックスに基づく新規なツーパスリードアライメントストラテジーの利点を利用する、高速で正確なRNA seqロングリードアライメントアプローチである。それは、複雑な遺伝子構造および深刻なシーケンシングエラーをうまく処理して、より高感度で正確かつ合意されたアラインメントを生み出す能力を有する。ほとんどのリードについて、deSALTは全長リードに沿ってエクソンとスプライシングジャンクション部を完全に回復する全長アライメントを生成することができる。さらに、deSALTの速度も、最先端のアプローチと同じか、それより速いか同等である。私達(本著者ら)はそれが多くの今後のトランスクリプトーム研究において重要な役割を果たす可能性があると信じている。

An example of the alignments of simulated reads by various aligners.　Preprintより転載。

deSALTに関するツイート

Liu, Liu, Zang, Wang and co present deSALT, for aligning long RNA-seq reads. it takes a two-pass approach, using a graph-based index to match blocks between read and genome, and then relocating short matches between read and detected exons. https://t.co/HVtdirlDt0 pic.twitter.com/tj9PNjoTQm
— Genome Biology (@GenomeBiology) 2019年12月16日

インストール

依存

#素の状態のubuntuに入れるなら
sudo apt update && sudo apt install git make gcc zlib1g zlib1g-dev

本体　Github

git clone https://github.com/ydLiu-HIT/deSALT.git
cd deSALT/src
make

> ./deSALT

# ./deSALT

Program: deSALT (Third generation RNA sequence alignment)

Version: 1.0

Contact: Yadong Liu <hitliuyadong1994@163.com>

Usage: deSALT <command> [options]

Command:

index index reference sequence

aln align long RNA sequence to reference

> deSALT index

# deSALT index

Usage: deSALT index <ref.fa> <index_route>

build deBGA index file with default 22-kmer. You can get more deBGA information from https://github.com/HongzheGuo/deBGA

> deSALT aln

# deSALT aln

[Main] deSALT - De Bruijn graph-based Spliced Aligner for Long Transcriptome reads

Program: de Brijn Graph-based 3rd RNA sequence alignment

Usage: deSALT aln [options] <index_route> <read.fa/fq>

Algorithm options:

-t --thread [INT] Number of threads. [1]

-K --index-kmer [INT] K-mer length of RdBG-index. [22]

-k --seeding-kmer [INT] K-mer length of seeding process (no long than RdBG-index). [15]

-a --local-hash-kmer [INT] K-mer length of local hash process. [8]

-s --seed-step [INT] The interval of seeding. [5]

-B --batch-size [INT] The number of reads to be processed in one loop. [100000]

-n --max-uni-pos [INT] Maximum allowed number of hits per seed. [50]

-l --max-readlen [INT] Maximum allowed read length. [1000000]

-i --min-frag-dis [INT] Maximum allowed distance of two fragment can be connect. [20]

-I --max-intron-len [INT] maximum allowed intron length. [200000]

-c --min-chain-score [INT] minimal skeleton score(match bases minus gap penalty). [20]

-d --strand-diff [INT] The minimal difference of dp score by two strand to make sure the transcript strand. [20]

-g --max-read-gap [INT] Maximum allowed gap in read when chaining. [2000]

-p --secondary-ratio [FLOAT] Min secondary-to-primary score ratio. [0.90]

-e --e-shift [INT] The number of downstream (upstream) exons will be processed when left (right) extension. [5]

-G --gtf [STR] Provided an annotation file for precise intron donor and acceptor sites.

The release of annotation file and reference genome must the same!

-x --read-type [STR] Specifiy the type of reads and set multiple paramters unless overriden.

[null] default parameters.

ccs (PacBio SMRT CCS reads): error rate 1%

clr (PacBio SMRT CLR reads): error rate 15%

ont1d (Oxford Nanopore 1D reads): error rate > 20%

ont2d (Oxford Nanopore 2D reads): error rate > 12%

Scoring options

-O --open-pen [INT(,INT)]

Gap open penealty. [2,32]

-E --ext-pen [INT(,INT)]

Gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2}. [1,0]

-m --match-score [INT] Match score for SW-alginment. [1]

-M --mis-score [INT] Mismatch score for SW-alignment. [2]

-z --zdrop [INT(,INT)]

Z-drop score for splice/non-splice alignment. [400]

-w --band-width [INT] Bandwidth used in chaining and DP-based alignment. [500]

Output options

-N --top-num-aln [INT] Max allowed number of secondary alignment. [4]

-Q --without-qual Don't output base quality in SAM

-f --temp-file-perfix [STR] Route of temporary files after the first-pass alignment. [1pass_anchor]

If you run more than one tgs program in the same time,

you must point out different routes of temporary files for each program!!!

If no, every deSALT program will write temporary data to the same file which

will cause crash of program in 2-pass alignment due to inconsistent temporary data.

-o --output [STR] Output file (SAM format). [./aln.sam]

> ./deBGA

# ./deBGA

Program: deBGA (De bruijn graph nucleotide alignment)

Version: 0.1

Contact: Hongzhe Guo <hzguo@hit.edu.cn>

Usage: deBGA <command> [options]

Command: index index sequences in the FASTA format

aln pair-end and single-end reads seed reduction and alignment based on De bruijn graph

> deBGA index

# deBGA index

Program: de Brijn Graph-based mapping system index building

Version: 0.1

Contact: Hongzhe Guo <hzguo@hit.edu>

Usage: deBGA index [options] reference.fasta <index_route>

Options: -k INT the k-mer length of the vertices of RdBG [20-28]

> deBGA aln

# deBGA aln

Program: de Brijn Graph-based mapping system seed reduction and alignment

Version: 0.1

Contact: Hongzhe Guo <hzguo@hit.edu>

Usage: deBGA aln [options] <index_route> <read pair-end1.fq> [read pair-end2.fq] <result_file.sam>

Options:

-k INT the minimum length of a valid Uni-MEM seed [21-28]

-s INT the number of iterations of re-seeding [4]

-i INT the minimum interval of seeding [5]

-n INT the maximum allowed number of hits per seed [300]

-c NUM the threshold on the edit distance for early stop [0.05]

--cl NUM the adjusted threshold on the edit distance for early stop [0.00]

--local the local alignment option for confident alignment

--local-match NUM the score for a matched base in the local alignment [1]

--local-mismatch NUM the penalty for a mismatched base in the local alignment [4]

--local-gap-open NUM the penalty for a gap open in the local alignment [6]

--local-gap-extension NUM the penalty for gap extension in the local alignment [1]

--stdout (default: not set) output alignments by stdout

-u INT the upper limit of insert size (only for pair-end reads) [700]

-f INT the lower limit of insert size (only for pair-end reads) [300]

-o INT the maximum number of alignment output [20]

-x INT the maximum number of alignment output for anchoring alignment [150]

-l INT the maximum allowed read length [512]

-e INT the budget for single-end alignment [100]

-p INT the number of threads [1]

Please refer to the following link for more detailed information about the options: https://github.com/HIT-Bioinformatics/deBGA

実行方法

1、indexing

deSALT index ref.fa index_dir

output_index_dir/にindexが出力される。

２、mapping

deSALT aln index_dir long_read.fq -t 16 -o aln.sam

-t Number of threads. [1]
-o Output file (SAM format). [./aln.sam]

引用

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index
Bo Liu, Yadong Liu, Tianyi Zang, Yadong Wang

bioRxiv preprint first posted online Apr. 17, 2019

追記

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Bo Liu, Yadong Liu, Junyi Li, Hongzhe Guo, Tianyi Zang, Yadong Wang
Genome Biology volume 20, Article number: 274 (2019)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

スプライシングジャンクションを上手く処理できるエラーの多いロングリードRNA seqのアライナーdeSALT