シーケンスエラーの多いロングリードのハイブリッドエラーコレクションツール HG-CoLoR

2019 2/9 タイトル修正

2019 5/24 condaインストール捕捉, HG-CoLoRのオプション変更に伴いパラメータ修正

2019 7/22 誤字修正、コマンド修正

2019 7/23タイトル修正、わかりにくいコマンド修正

2020 3/2 コマンド更新

2020 3/9 インストール手順修正

2020 6/10,6/11 補足説明追加

　最近のPacific Biosciences やOxford Nanoporeのようなロングリードシーケンシング技術は、ショートリード技術で許容されるより大きくて複雑なゲノムのアセンブリ問題を解決する。しかし、これらのロングリードは非常にノイジーで、Pacific Biosciencesでは約10〜15％、Oxford Nanoporeでは最大30％のエラー率に達する。エラー訂正の問題は、ロングリードを自己訂正するか、ショートリードで補完するハイブリッドの手法を使って取り組まれてきたが、ほとんどのメソッドはPacific Biosciencesのデータにのみ焦点を当て、Oxford Nanoporeのリードには適用されない。さらに、Oxford Nanoporeの最近のケミストリーでは、エラー率を15％以下に下げることが約束されているが、それは実際にはまだ高く（論文執筆時点）、このようなノイズの多いロングリードを修正することは依然として課題である。
　HG-CoLoRは、ショートリードとロングリードのアライメント、可変オーダーのBruijn graphのトラバースに基づくシード・アンド・エクステンデッド・アプローチに重点を置いたハイブリッド・エラー訂正方法である。著者らの実験は、HG-CoLoRが44％という高いエラー率のOxford Nanoporeのロングリードを効率的に修正することを示している。他の最先端のロングリードエラー訂正方法と比較し、HG-CoLoRが実行時間と結果の品質の間で最良のトレードオフを提供し、真核生物に効率的に拡張できる唯一の方法であることを示す。

HG-CoLoRに関するツイート

インストール

ubuntu16.04に導入した。

依存

A Linux based operating system.
Python3.
Emboss binaries accessible through your PATH environment variable (http://emboss.sourceforge.net/download/).
KMC3 binaries accessible through your PATH environment variable (https://github.com/refresh-bio/KMC).
QuorUM binary accessible through your PATH environment variable (https://github.com/gmarcais/Quorum).

#本体をcondaで導入するなら不要
conda install -c bioconda -y emboss kmc QuorUM

本体 Github

#Anaconda環境ならkmcなどの依存も含めてcondaで導入可能（linux only）
conda install -c bioconda hg-color

#from source
git clone https://github.com/pierre-morisse/HG-CoLoR
cd HG-CoLoR/
git submodule init 
git submodule update
cd KMC/ && make -j 
cd ../PgSA/ && make build CONF=pgsalib 
cd .. && make

> ./HG-CoLoR --help

# HG-CoLoR --help

Usage: /root/.pyenv/versions/miniconda3-4.3.14/bin/HG-CoLoR [options] --longreads LR.fasta --shortreads SR.fastq --out result.fasta --tmpdir tmp_directory

Note: HG-CoLoR default parameters are adapted for a 50x coverage set of short reads with a 1% error rate.

Please modify the parameters, in particular the --solid and --bestn ones, as indicated below if using a set of short reads with a much higher coverage and/or a highly different error rate.

Input:

LR.fasta: fasta file of long reads, one sequence per line.

SR.fastq: fastq file of short reads.

Warning: only one file must be provided.

If using paired reads, please concatenate them into one single file.

It is recommended to run HG-CoLoR with a 50x coverage of short reads.

Results quality tends to drop with a higher coverage.

result.fasta: fasta file where to output the corrected long reads.

tmp_directoty: directory where to store the temporary files.

Options:

--kmer: k-mer size for the graph construction (default: 64).

--solid: Minimum number of occurrences to consider a k-mer as solid, after short reads correction (default: 5).

This parameter should be raised accordingly to the short reads coverage and accuracy.

Its default value is adapted for a 50x coverage of short reads with a 1% error rate.

--seedsoverlap: Minimum overlap length to allow the merging of two overlapping seeds (default: k-1).

--minoverlap: Minimum overlap length to allow the exploration of an edge of the graph (default: k-5).

--backtracks: Maximum number of backtracks (default: 1,000).

Raising this parameter will result in less fragmented corrected long reads.

However, it will also increase the runtime, and may create chimeric linkings between the seeds.

--seedskips: Maximum number of seed skips (default: 5).

--bestn: Top alignments to be reported by BLASR (default: 30).

This parameter should be raised accordingly to the short reads coverage.

Its default value is adapted for a 50x coverage of short reads.

--kmcmem: Maximum amount of RAM for KMC, in GB (default: 12).

--nproc: Number of processes to run in parallel (default: number of cores).

--help: Print this help message.

実行方法

ショートリードとロングリードを指定してランする。ペアエンドのショートリードはあらかじめコンカテネートして、１ファイルで指定する必要がある。fastqは解凍して使用する。

#paired-end fastqの結合
seqtk mergepe R1.fq R2.fq > paired.fq

#HG-CoLoR実行, versionによっては"--kmer 64"と指定する。
HG-CoLoR --longreads long_reads.fasta --shortreads paired.fq \
 --out HG-color_result.fasta --tmpdir tmp -K 64

--shortreads fastq file of short reads. Warning: only one file must be provided. If using paired reads, please concatenate them into one single file.

引用
Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph
Morisse P, Lecroq T, Lefebvre A

Bioinformatics. 2018 Dec 15;34(24):4213-4222