k-merサイズを変えながらエラー訂正を繰り返す SGA-ICE (IterativeErrorCorrection)

　イルミナのMiSeqでシーケンスを1回実行すると、300 bpのペアエンドで15ギガバイト（GB）のデータが出力される。Illumina HiSeq 2500では、最大ペアエンド250 bpで300 GBのシーケンスが可能担っている。この高いスループットは、ゲノムアセンブリにとって魅力的なものである。 Illuminaのデータではエラーは1％未満だが、１つのリードでエラーがゼロの確率は低く、特に250または300 bpの長いリードでは低くなる。ゲノムアセンブリのためには、可能な限り正確なリードを使うことが望ましい。よって、シーケンシングエラーの修正は必須の前処理ステップとなる。すべてのエラーはまれであり、シーケンシングカバレージは十分に高く、同じゲノム遺伝子座をカバーする他のリード情報を用いてエラーを訂正することができる。エラー訂正ツールは、塩基置換を主に扱うk-merベースの訂正と、挿入と削除を訂正するオーバーラップベースの2つに分類できる。それぞれの手法の詳細な概要はLaehnemann et al. [論文より ref.1]によって説明されている。

　k-merベースのエラー訂正の背後にあるアイデアは、シーケンシングエラーがゲノムには存在しないため、低い頻度のk-mer配列になっていることである。低頻度のk-merを検出・置換することにより、誤った塩基を修正することができる。 K-merベースの補正は、k-merサイズのためのパラメータ選択、およびまれでないk-merのカウントに依存する。オーバーラップベースの補正の背後にあるアイデアは、似たリード、すなわちおそらくは同じ遺伝子座に由来するリードでマルチプルアラインメントを構築することにある。シークエンシングエラーは、アラインメントのまれな差として検出され、アラインメントの同じカラムのコンセンサス配列で補正される。オーバーラップベースの補正は、リードの最小の相同性、まれな違いをカウントするしきい値、およびコンセンサスをサポートする最小のリード数のパラメータに依存する。

　ほとんどのツールは、テストされたデータセットでうまく機能し、小さなゲノムではほとんどすべてのエラーを修正することができる[ref.20]。しかし、ヒトゲノムのような複雑でリピートリッチなゲノムでは、エラー訂正後もかなりのリードが誤りを有する。例えば、最高性能のツール（[ref.20]の表2）で修正した後でも、ヒトの100bpのHiSeqシーケンスデータでは15〜20％のエラーが残る。このパフォーマンスは、リードが長くなるとさらに悪化する。MiSeqで読んだE.coliゲノムの250bpのリードの半分以上にエラーが残っている（[ref.20]の表3）。シーケンスエラーが訂正されないまま残る理由はいくつかある。第1に、多くのエラーを伴うリードは、他のリードと似ていないので、訂正が困難である。しかしこのようなリードは、頻繁に使用されるk-mersが数多く存在するため、破棄が容易である。第2に、低い低カバレッジの遺伝子座のリードは、solid k-mers（与えられた閾値よりも頻繁に発生するk -mer）の欠如のために訂正されないかもしれない。第3に、エラーのk-merがゲノムの他の場所に頻繁に見つかると、シークエンシングエラーとして検出されずに残っている可能性がある。これは特にリピート領域で起こる。最初の2つのケースは計算上の扱いが困難だが、250または300 bpのより長いリードならば第３のエラーは低減される。

本論文ではエラー訂正後に補正されていないシーケンシングエラーを伴う多くの読み取りが繰り返し重複していることを示している。これらのリードでは、短い誤ったk-merが別のリピートで同じように発生し、誤って正しいと見なされる。長いイルミナリードのエラー訂正を改善するために、String Graph Assemblerのモジュールを使用し、k-merベースの補正をk-merサイズを増やしながら複数回実行する反復誤差補正パイプラインが開発された。この反復戦略は、リピート領域のエラーを効果的に修正し、誤ったリードの総量を削減することを示している。さらに、この高い読み取り精度によって、コンティグが2~3倍長くなることを示している。

インストール

本体 Github

https://github.com/hillerlab/IterativeErrorCorrection

git clone https://github.com/hillerlab/IterativeErrorCorrection.git
cd IterativeErrorCorrection/
python SGA-ICE.py

python SGA-ICE.py

$ python SGA-ICE.py

usage: SGA-ICE.py [-h] [-k KMERS] [-t THREADS] [--noOvlCorr] [--noCleanup]

[--scriptName SCRIPTNAME] [--errorRate ERRORRATE]

[--minOverlap MINOVERLAP]

inputDir

SGA-ICE produces a shell script that contains all commands to run iterative

error correction of the given read data with the given parameters. Read data

must be in fastq format and files need to have the ending .fastq or .fq.

positional arguments:

inputDir Path to directory with the *.fastq or *.fq files. The

produced shell script will be located here.

optional arguments:

-h, --help show this help message and exit

-k KMERS, --kmers KMERS

List of k-mers for k-mer correction; values should be

comma-separated. If -k is not provided, SGA-ICE does 3

rounds of k-mer correction with k-mer sizes determined

based on the length of the read from the first file in

inputDir. We advise the user to choose k-mer values

manually if the sequences in the *.fastq files have

different read lengths.

-t THREADS, --threads THREADS

Number of threads used. Default is 1. Set to higher

values if you have more than one core and want to

reduce the runtime.

--noOvlCorr If set, do not run a final overlap-based correction

round.

--noCleanup If set, keep all intermediate files in the temporary

directory.

--scriptName SCRIPTNAME

Name of the shell script containing the error

correction commands. By default, script is called

runMe.sh

--errorRate ERRORRATE

sga correct -e parameter for overlap correction.

Maximum error rate allowed between two sequences to

consider them overlapped. Default is 0.01

--minOverlap MINOVERLAP

sga correct -m parameter for overlap correction.

Minimum overlap required between two reads. Default is

ラン

fastqのディレクトリを指定してランする（fqまたはfastqを認識する）。

SGA-ICE.py /path/to/fastq/data/ -k 40,60,100,125,150,200 --noCleanup --noOvlCorr --scriptName correctMyData.sh

-k List of k-mers for k-mer correction; values should be comma-separated. If -k is not provided, SGA-ICE does 3 rounds of k-mer correction with k-mer sizes determined based on the length of the read from the first file in inputDir. We advise the user to choose k-mer values manually if the sequences in the *.fastq files have different read lengths.
-noCleanup If set, keep all intermediate files in the temporary directory.
--noOvlCorr If set, do not run a final overlap-based correction round.
--scriptName Name of the shell script containing the error correction commands. By default, script is called runMe.sh
-t THREADS, --threads THREADS Number of threads used. Default is 1. Set to higher values if you have more than one core and want to reduce the runtime.

上のコマンドではk-merサイズを上げながら6回繰り返すことになる。ランが終わるとシェルスクリプトcorrectMyData.shができるので、それを実行するとエラー訂正が自動で行われる。

引用

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly.

Sameith K, Roscito JG, Hiller M.

Briefings in Bioinformatics. 2017 Jan;18(1):1-8.