高速かつ高感度なRNA/DNAのアライナー HPG Aligner

　ハイスループットシーケンサーの最新世代は、前例のないスケールでデータを生成し、関連するシーケンシングコストが連続的に減少している。特に、トランスクリプトームの包括的なプロファイルを提供するRNAシーケンシング（RNA-seq）技術（論文より ref.1）は、従来のマイクロアレイをますます置き換えている（ref.2）。RNA-seqにおける一次データプロセシング（ゲノムリシークエンシングを含む他の大規模シーケンシング実験）は、リファレンスゲノムへのリードのマッピングを含む。この段階は、計算上高価なプロセスであり、加えて感度が重大な関心事である（ref.3）。さまざまなプログラムが存在し、その多くは、検索スピードを大幅に高速化させるindex技術であるBWT（Burrows-Wheeler Transform）アルゴリズムを実装している。

　近年提案されているマッピングアルゴリズムは、DNAリードの約98〜99％をマップする（これまでの利用可能なマッパーで得られた96％に対してわずか2％のゲイン）。しかし、RNA-seqのシナリオはこれらの基準からは程遠い。転写物のシーケンシングリードのマッピングは、単純にゲノム配列にリードをマッピングするよりもはるかに複雑な問題である。真核生物のトランスクリプトームは複雑で、平均で１遺伝子あたり９つの転写物をもち（ref.12）、250bpのエキソンを遺伝子あたり12個以上持ち、その長さは何十万bpにもおよぶ。さらに新しい転写物アイソフォームが絶えず発見されている（ref.13)。シーケンシングエラーを伴う現実的なシナリオでは、 TopHatなど（ref.14）は、100 bpのリードの70％しか正確にマッピングせず、ようやくGSNAPによって最高の感度（82％）が達成されている（ref.15）。この新しい報告によると、RUM（ref.17）、STAR（ref.18）またはMapSplice（ref.19）のような他のRNA-seqマッパーは、シーケンシングリード精度がさらに劣る。しかし、より最近の報告から（ref.20）、STARやMapSpliceのパフォーマンスは他の方法と比較してより良好で、これは以前のベンチマークと一致している（ref.17）。リード長が長くなると（シーケンシングマシンの連続的なアップグレードの自然な傾向）感度が低下し、感度とランタイムの点で重大な問題を引き起こす。ほとんどのBWTベースのマッパーでは問題がある。なぜなら、BWTベースのマッパーは少数のミスマッチしか許さないためである（変化が少ないシナリオで考えても現在のリード長では不十分）。Bower2（ref.10）を使用するTopHat2（ref.14）のような新しいアライナーはマッピングステップを大幅に加速したが（感度のトレードオフが小さい）、ジャンクション検出の戦略は変わらない。この研究が検討されている間、BWTとFerragina-Manzini indexをスキームに組み合わせた新しいマッパーHISAT （ref.21）がpublishされた。他と同様に、スピードアップは感度とのトレードオフで達成されたようである。

　トランスクリプトーム解析のためにRNA-seqを使う関心の高さや定量のためのアルゴリズムの進歩、transcripts発見のためのアルゴリズムの進歩にも関わらず、基本的な問題であるリードのマッピング問題が解決されていない。

　ここでは、BWTによるマッピングとSmith-Waterman（SW）とのローカルアラインメントの組み合わせに基づいて、ショートリードとロングリード両方のハイクオリティなマッピングのための革新的なソリューションを提案する。これは、マッピング精度を大幅に高め（典型的なシナリオで現在のマッパーが60 ％に対して95 %）、ランタイムを大幅に短縮する（TopHat2と比べて約15倍、MapSpliceで8倍、STARで2倍）。さらに、提案された戦略は、indelとミスマッチに対して非常に堅牢であることも示されている。この提案は、多数のミスマッチやindelを含むものであっても、ほぼすべてのリードをマッピングするシンプルで高速で洗練されたソリューションを提供する。この解決策はマッピングステップにおいてかなりの時間を節約し、その結果、シーケンスデータ処理の現在のパイプライン高速化に決定的に寄与する。さまざまなハイパフォーマンスコンピューティング (HPC) テクノロジを組み合わせたHPG Alignerは、ショートリードとロングリード両方で優れたパフォーマンスを示す。

チュートリアル

https://github.com/opencb/hpg-aligner/wiki/Tutorial

インストール

cent os6でテストした。

Github

yum install gcc 
yum install scons 
yum install zlib-devel libcurl-devel libxml2-devel ncurses-devel gsl-devel check-devel


git clone https://github.com/opencb/hpg-aligner.git 
cd hpg-aligner
git submodule update --init 
cd lib/hpg-libs 
git checkout master 
cd ../.. 
git checkout master 
scons
cd bin/

> ./hpg-aligner -h

$ ./hpg-aligner -h

Usage:

hpg-aligner {dna | rna | build-bwt-index | build-sa-index} [-f|--fq|--fastq=<file>] [-j|--fq2|--fastq2=<file>] [-z|--gzip] [-i|--index=<file>] [-o|--outdir=<file>] [--filter-read-mappings=<int>] [--filter-seed-mappings=<int>] [--min-cal-size=<int>] [-t|--cpu-threads=<int>] [--read-batch-size=<int>] [--sw-match=<double>] [--sw-mismatch=<double>] [--sw-gap-open=<double>] [--sw-gap-extend=<double>] [--min-score=<int>] [--paired-min-distance=<int>] [--paired-max-distance=<int>] [--report-best] [--report-all] [--report-n-best=<int>] [--report-n-hits=<int>] [--report-only-paired] [--prefix=<string>] [-l|--log-level=<int>] [-h|--help] [--bam-format] [--indel-realignment] [--recalibration] [-a|--adapter=<string>] [-v|--version] [--num-seeds=<int>]

-f, --fq, --fastq=<file> Reads file input. For more than one file: f1.fq,f2.fq,...

-j, --fq2, --fastq2=<file> Reads file input #2 (for paired mode)

-z, --gzip FastQ input files are gzip

-i, --index=<file> Index directory name

-o, --outdir=<file> Output directory

--filter-read-mappings=<int> Reads that map in more than <n> locations are discarded

--filter-seed-mappings=<int> Seeds that map in more than <n> locations are discarded

--min-cal-size=<int> Minimum CAL size

-t, --cpu-threads=<int> Number of CPU Threads

--read-batch-size=<int> Batch Size

--sw-match=<double> Match value for Smith-Waterman algorithm

--sw-mismatch=<double> Mismatch value for Smith-Waterman algorithm

--sw-gap-open=<double> Gap open penalty for Smith-Waterman algorithm

--sw-gap-extend=<double> Gap extend penalty for Smith-Waterman algorithm

--min-score=<int> Minimum score for valid mappings (0 to 100 for RNA)

--paired-min-distance=<int> Minimum distance between pairs

--paired-max-distance=<int> Maximum distance between pairs

--report-best Report all alignments with best score

--report-all Report all alignments

--report-n-best=<int> Report the <n> best alignments

--report-n-hits=<int> Report <n> hits

--report-only-paired Report only the paired reads

--prefix=<string> File prefix name

-l, --log-level=<int> Log debug level

-h, --help Help option

--bam-format BAM output format (otherwise, SAM format. This option is only available for SA mode, BWT mode always report in BAM format), this option turn the process slow

--indel-realignment Indel-based realignment

--recalibration Base quality score recalibration

-a, --adapter=<string> Adapter sequence in the read

-v, --version Display the HPG Aligner version

--num-seeds=<int> Number of seeds

> hpg-aligner build-bwt-index -h

$ hpg-aligner build-bwt-index -h

Usage:

hpg-aligner {build-sa-index | build-bwt-index} [-i|--index=<file>] [-g|--ref-genome=<file>] [-h|--help] [-v|--version] [-r|--index-ratio=<int>]

-i, --index=<file> Index directory name

-g, --ref-genome=<file> Reference genome

-h, --help Help option

-v, --version Display HPG Aligner version

-r, --index-ratio=<int> BWT index compression ratio

実行方法

RNAのマッピング。

1、index作成。Burrows-Wheeler変換してメモリー使用量を抑える。

mkdir index_dir #indexディレクトリの作成
hpg-aligner build-bwt-index -g ref.fasta -i index_dir -r 8

2、マッピング。デフォルトでは全スレッドが使用される。

hpg-aligner rna -i index_dir -f pair1.fq -j pair2.fq -o output-dir

output-dir/にbam出力される。

パフォーマンス比較。RNAはページの一番下にあります。

https://github.com/opencb/hpg-aligner/wiki/Results-and-benchmarks

引用

Highly sensitive and ultrafast read mapping for RNA-seq analysis

I. Medina, J. Tárraga, H. Martínez, S. Barrachina, M. I. Castillo, J. Paschall, J. Salavert-Torres, I. Blanquer-Espert, V. Hernández-García, E. S. Quintana-Ortí, and J. Dopazo

DNA Res. 2016 Apr; 23(2): 93–100.