ロングリードのハイブリッドエラーコレクションツール Hercules

2018 10/15 誤字修正

2019 5/23 ”make -j 8”に修正, docke help追記

　ハイスループットシーケンシング（HTS）技術は、ゲノミクスの分野に革命をもたらしたが、2つの基本的な制限がある。まず第一に、プラットフォームはまだ染色体のロングリードを生成することができない。プラットフォームによっては、平均リード長は100 bp〜20 kbである。第二に、リードにエラーがないわけではない。最も普及しているプラットフォームであるイルミナ（Illumina）の場合、最も正確だが（~0.1％のエラー率）、最も短い（100-150bp）リード（ref.1）を生成する。リード長の短さは、正確で再現性のある解析（ref.2,3）、信頼性の高いアセンブリの構築（ref.4-6）への課題を提示する。一方、Pacific Biosciences Single Molecule、Real-Time（SMRT）シークエンシング技術は、平均で10 kbを超える長さのリードを生成することができるが、実質的に高い（約15％）エラー率を伴う（ref.7）。同様に、Oxford Nanopore Technologies（ONT）プラットフォームは、より長いリード（〜900kbまで）を生成することができる。しかし、そのエラー率もより高い（> 15％）（ref.8）。 PacBioまたはONTのリードを使用して高い塩基対の正確さを達成することができるが（ref.9 link9）、これには非常に高いカバレッジが必要になる。

　上記のプラットフォームの長所と短所は、それらを組み合わせることを魅力的にしている。そのような組み合わせは、研究者がPacBioおよびONTプラットフォームによって生成されたロングリードを利用しながら、Illuminaのリードと同じ精度を得ることを可能にする。しかし、このアプローチの欠点は、Illuminaのリードが得られなかったゲノム領域ではエラー修正できないことである。それでもいくつかのハイブリッドエラー訂正方法が開発されており、それらは2つの主要なカテゴリーに分類される。最初のアプローチは、PacBioToCA（ref.10）、LSC（ref.11）、proovread（ref.2）、Colormap（ref.13）（紹介）などのいくつかのツールで実装された。これらは同じサンプルから生成されたロングリードにショートリードを合わせることから始まる。これらのアルゴリズムは、比較的高いカバレッジのショートリードの精度を利用して、同じセグメントにわたるショートリードのコンセンサスを計算することによってロングリードのエラーを訂正する。第2のアプローチは、ショートリードを使用して構築されたde Bruijnグラフ上にロングリードをアライメントさせ、ロングリードと接続されたde Bruijnグラフのk-mersはマージされ、新しい修正されたロングリードになる。このアプローチの例は、LoRDEC（ref.14）（紹介）、Jabba（ref.15）（紹介）、HALC（ref.16）（紹介）である。 de Bruijnグラフベースのアルゴリズムであるにもかかわらず、LoRMA（ref.17）は、修正のためにロングリードのみを使用するため、ハイブリッドツールではない。

　どちらのアプローチもいくつかの場合には上手くいくが、いくつかの欠点もある。アライメントベースのアプローチは、アライナーのパフォーマンスに大きく依存する。したがって、アライナーの精度、実行時間、およびメモリ使用量は、ダウンストリームの訂正ツールのパフォーマンスに直接影響する。第2のアプローチであるde Bruijnグラフのアプローチは、外部アライナーへの依存を排除し、コンセンサス計算ステップを暗黙的にグラフ構築に移す。しかし、ショートリードはエラー率が非常に低くグラフを作成するときにはリードよりもっと短いk-merを使用するが、結果として生じるBruijnグラフには、通常はエラーとして処理されて削除されるbulgesとtipsが含まれる（ref.18）。このようなグラフ要素の正確な除去ができるかどうかは、実際のエッジを誤ったk-mer（ref.19,20）と確実に区別可能な高いカバレッジデータが利用できるかどうかに依存している。

　ここでは、long readのbasepair精度を向上させるための、新しいアライメントベースのHybrid ERror Correction algorithm、Herculesを紹介する。 Herculesは、機械学習ベースのロングリードエラー訂正アルゴリズムの第1号である。 Herculesは、テンプレート・プロファイルの隠れマルコフ・モデル（プロファイルHMMまたはpHMMと呼ばれる）として、それぞれの長いエラーのあるリードをモデル化する。これは、Forward-Backwardアルゴリズム（ref.21）を介してモデルを訓練するための観測にショートリードを使用し、事後遷移および排出確率を学習する。最後に、Herculesは、Viterbi algorithm（ref.22 link22）を使用して各プロファイルHMMの最も可能性の高いシーケンスを解読する。 HMMは（ref.23,24）以前にショートリードのエラー訂正に使用されてきたが、ロングリードのエラー訂正では最初の使用である。

　Herculesの他のアライメントベースツールに対する主な利点は、pHMMsを使うことで（i）アライナーの性能への依存を減らし、（ii）実験的に観察されたロングリードのエラープロファイルを直接組み込んだり、新しいシーケンシングプラットフォームに直接適合させることができることである。アラインメントに基づく方法は、各塩基対が訂正するアライナーの完全なCIGARストリングに依存する。それらは、すべてのエラータイプが訂正時に同様に発生する可能性が高いとの仮定の下、ショートリード間の不一致を解決するために多数決を実行する。アライナーが異なるエラータイプの可能性を考慮に入れることができるという事実にもかかわらず、訂正ステップはアライナーの選択に依存する。対照的に、Herculesは、アライナーから取得したスタートポジションを使用するが、アライナーによって提供される他の情報には依存しない。それは多数決で塩基対ごとに独立に使用する代わりに、逐次的かつ確率的にショートリードごとに提供されたエビデンスを説明する。さらに、HMM事前確率をエラータイプに使用することは、処理されるプラットフォームのエラープロファイルに基づいて構成することができる。事前確率が一様ではないので、アルゴリズムは事後結果を予測するためにより良い位置にあり、したがって、ロングリード技術に基づいて適応させることもできる。

　著者らは、Herculesを他の手法と以下のデータセットを使い比較した: (i) ヒト17番染色体の２つ複雑な領域、すなわちCH17-157L1 と CH17-227A2のBACクローン（ref.7）、および（ii）ヒト胞状奇形細胞株（ CHM1）（ref.25）。Ground truthとして、(i) のBACクローンについては、同じサンプルのサンガーシーケンシングデータからアセンブリしてfinisihingした配列を使用し、（ii）についてはCHM1_1.1アセンブリ（ref.25）を使用した。

　BACクローンの結果は、ショートリードのカバレッジが高い場合、Herculesが最高のマッピング速度を示し、ロングリードのショートリードのカバレッジの高い範囲が多数を占める場合（90％）、Herculesは最も正確なリードセットを生成した（すなわち、> 95％の精度）。（一部略）

Herculesは、このような大きな問題サイズにスケーリング可能な2つのアルゴリズムの1つであることを示した。適度なショートリードカバレッジ（40x未満）にもかかわらず、Herculesは、LoRDECより128％改善された最も正確なリード（i.e. >95%）を生成する。

Overview of the Hercules algorithm. 論文より転載

Herculesに関するツイート

インストール

mac os10.13でテストした。

ビルド

Make sure you have a compiler that has support for C++14

ツール依存

SAMtools
Bowtie2

本体　Github

git clone https://github.com/BilkentCompGen/hercules.git
cd hercules/src/
make -j 8
cd ../bin/

> ./hercules -h

$ ./hercules -h

Hercules: A Profile HMM-based hybrid error correction algorithm for long reads

==============================================================================

SYNOPSIS

DESCRIPTION

OPTIONS

-h, --help

Display the help message.

--version

Display version information.

-1, --preprocess

Compresses the required reads and creates an proper fasta files and an index file for the long read. Created

reads should be provided to a aligner

-2, --correct

Corrects the long reads using the alignment and index file created in the preprocess step

VERSION

Last update: November 2017

Hercules: A Profile HMM-based hybrid error correction algorithm for long reads version: 0.1

SeqAn version: 2.4.0

> ./hercules -1 -h

$./hercules -1 -h

Hercules: A Profile HMM-based hybrid error correction algorithm for long reads

==============================================================================

SYNOPSIS

DESCRIPTION

OPTIONS

-h, --help

Display the help message.

--version

Display version information.

-li, --longInputFile INPUT_FILE

fast{a,q} file which contains original long reads

-si, --shortRead List of INPUT_FILE's

Short reads file to align to the long reads. You may define as many short reads file as you wish with

multiple -si options.

-o, --outputDir OUTPUT_FILE

Preprocessing directory where the resulting files will be written. This directory **MUST** exist beforehand.

-nonN, --nonN INPUT_FILE

**Compressed** short read should have at least nonN many non-N characters not to be filtered out for the

alignment phase. Default: 40.

-b, --bloomFilter

Apply bloom filter to remove duplicates.

-nc, --noCompression

Do not compress short reads. Reported short reads will only be filtered out according to nonN value and

bloom filter if it is set. You need to set the same option in the correction phase as well.

VERSION

Last update: November 2017

Hercules: A Profile HMM-based hybrid error correction algorithm for long reads version: 0.1

SeqAn version: 2.4.0

> ./hercules -2 -h

$ ./hercules -2 -h

Hercules: A Profile HMM-based hybrid error correction algorithm for long reads

==============================================================================

SYNOPSIS

DESCRIPTION

OPTIONS

-h, --help

Display the help message.

--version

Display version information.

-li, --longInputFile INPUT_FILE

fast{a,q} file which contains original long reads

-ai, --alignmentFile INPUT_FILE

{s,b}am file which contains alignments of short reads to long reads.

-si, --shortRead INPUT_FILE

**Uncompressed** short read file created during the preprocessing step

-o, --outputFile OUTPUT_FILE

Output file to write the resulting reads

-c, --outputCoverage

If specified, Hercules creates another file within the same folder of corrected reads, which reports how

much of a read is covered by short reads

-q, --mapQ INTEGER

Minimum mapping quality for a long-short reads alignment to use in correction. Note that if multiple

alignment specified, aligners report these mapping quality as 0 In range [0..255]. Default: 0.

-mc, --maxCoverage INTEGER

Maximum short read coverage per position of a long read. If provided, short reads are removed based on edit

distance value. Otherwise, short reads are removed randomly. Setting 0 will use all short reads. In range

[0..inf]. Default: 1.

-mf, --filterSize INTEGER

Filter size that allows calculation of at most mf many most probable transitions in each time step. This

parameter is directly proportional to running time. In range [1..inf]. Default: 100.

-mi, --maxInsertion INTEGER

Maximum number of insertions in a row. This parameter is directly proportional to the running time. In range

[0..inf]. Default: 3.

-md, --maxDeletion INTEGER

Maximum number of deletions in a row. This parameter is directly proportional to the running time. In range

[0..inf]. Default: 10.

-trm, --matchTransition DOUBLE

Initial transition probability to a match state. See --insertionTransition as well. In range [0..1].

Default: 0.7.

-tri, --insertionTransition DOUBLE

Initial transition probability to a insertion state. Note that: deletion transition probability = 1 -

(matchTransition + insertionTransition) In range [0..1]. Default: 0.25.

-df, --deletionTransitionFactor DOUBLE

Factor of the polynomial distribution to calculate each deletion transition. Higher value favors less

deletions. In range [0..inf]. Default: 2.5.

-emm, --matchEmission DOUBLE

Initial emission probability of a match to a reference. Note that: mismatch emission probability =

(1-matchEmission)/3 In range [0..1]. Default: 0.97.

-t, --thread INTEGER

Number of threads to use In range [1..inf]. Default: 1.

-nc, --noCompression

Set this option if short and long reads are not compressedin the preprocessing step.

-nv, --noVerbose

Hercules runs quitely with no informative output

VERSION

Last update: November 2017

Hercules: A Profile HMM-based hybrid error correction algorithm for long reads version: 0.1

SeqAn version: 2.4.0

またはdockerイメージをビルドする。

cd hercules/docker/
docker build -t hercules:latest . 

#help
docker run hercules -h

実行方法

1、前処理

#先にディレクトリを作る。無いとエラーを起こす
mkdir preprocessing

#実行
hercules -1 -li long.fasta -si short_1.fastq -si short_2.fastq -o preprocessing/

２、マッピング

シェルスクリプトutil/runBowtieRmDup.shを使うことで、bowtie2のindex作成、bowtie2のマッピングを自動化している。

./utils/runBowtieRmDup.sh preprocessing/compressed_long.fasta preprocessing/compressed_short.fasta bowtie 30

３、bam作成

冗長なリードを除去し、sort済みbamを作成する。シェルスクリプトutils/afteralignment.shを使う。

./utils/afteralignment.sh bowtie/alignment.bam output_alignment.bam 30 8G

sort済みbamはbowtie/に出力される。

４、Herculesによるエラーコレクション

オリジナルのロングリード、1で出力されたショートリード、3で出力された.bamを指定する。

hercules -2 -li long.fasta -ai alignment.bam -si preprocessing/short.fasta -t 30 -o corrected_long.fasta

corrected_long.fastaが出力される。

引用

Hercules: a profile HMM-based hybrid error correction algorithm for long reads
Firtina C, Bar-Joseph Z, Alkan C, Cicek AE

Nucleic Acids Res. 2018 Aug 16

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ロングリードのハイブリッドエラーコレクションツール Hercules