ロングリードのハプロタイプを考慮したエラー訂正を行う HERRO

20240419　タイトル修正

2024/08/05 引用の間違い修正

注；論文のタイトルにはHEROと書かれてますが、レポジトリではHERROとなっています。ここではHERROで統一します。

追記

HEROとHERROを混同していました。コメントで教えていただき本当にありがとうございました。ここでは、Haplotype-aware ERRor cOrrection: HERRO（Github）を紹介します。

Ready for >Q30 ONT reads? Herro error correction now supports both R9.4.1 and R10.4.1. simplex reads w/ @domstanojevic @DehuiLin (CHM13 -R9.4.1) #nanopore #longreads pic.twitter.com/LgijFkgxsq
— Mile Sikic (@msikic) 2024年4月19日

インストール

依存

Linux OS (tested on RHEL 8.6 and Ubuntu 22.04)
Zstandard
Python (and conda) for data preprocessing

注；HERROはGPUのVRAMをかなり使う。推奨80GBかつ複数GPUとなっている（ヒトゲノムなど）。

説明されている通り、レポジトリをcloneして依存するツールを導入（前処理のステップ１と２用）、それからsingularityイメージをビルドした。最後にモデルをダウンロードして実行した（古いsingularityだと動かないので注意。v3.9.5を使用した）。

Github

https://github.com/lbcb-sci/herro

#1
git clone https://github.com/dominikstanojevic/herro.git
cd herro
mamba env create --file scripts/herro-env.yml

#2  singularityイメージのビルド
sudo singularity build herro.sif herro-singularity.def
#もしくはビルド済みのイメージをダウンロード（3.5GB）
wget http://complex.zesoi.fer.hr/data/downloads/herro.sif

> scripts/preprocess.sh

Please place porechop_with_split.sh and no_split.sh in the same directory as this script.

This script requires 4 arguments:

1. The input sequence file. e.g. input.fastq

2. The output prefix. e.g. 'preprocessed' or 'output_dir/preprocessed'

3. The number of threads to be used.

4. The number of parts to split the inputs into for porechop (since RAM usage may be high).

> scripts/create_batched_alignments.sh

Please place batch.py in the same directory as this script.

This script requires 4 arguments:

1. The path to the preprocessed reads.

2. The path to the read ids of these reads e.g. from seqkit seq -n -i.

3. The number of threads to be used.

4. The directory to output the batches of alignments.

> singularity run --nv herro.sif inference -h

$ singularity run --nv herro.sif inference -h

Subcommand used for error-correcting reads

Usage: herro inference [OPTIONS] -m <MODEL> -b <BATCH_SIZE> <READS> <OUTPUT>

Arguments:

<READS> Path to the fastq reads (can be gzipped)

<OUTPUT> Path to the corrected reads

Options:

--read-alns <READ_ALNS> Path to the folder containing *.oec.zst alignments

--write-alns <WRITE_ALNS> Path to the folder where *.oec.zst alignments will be saved

-w <WINDOW_SIZE> Size of the window used for target chunking (default 4096) [default: 4096]

-t <FEAT_GEN_THREADS> Number of feature generation threads per device (default 1) [default: 1]

-m <MODEL> Path to the model file

-d <DEVICES> List of cuda devices in format d0,d1... (e.g 0,1,3) (default 0) [default: 0]

-b <BATCH_SIZE> Batch size per device. B=64 recommended for 40 GB GPU cards.

-h, --help Print help

モデルのダウンロード

wget http://complex.zesoi.fer.hr/data/downloads/model_v0.1.pt

実行方法

HERROは複数のステージから構成されている。ステップ１、２はcondaで作った環境で実行する前処理のステップ（HERROは使わない）。ステップ３のherro inferenceコマンドはsingularityイメージを使用する。

1、Preprocess reads

PorechopによるONTリードの前処理。10スレッド指定。最後にメモリ使用量を削減するために一過的にリードを分割して扱うための分割数を指定する。分割が不要な場合は、メモリ使用量が増えるが<parts_to_split_job_into>を1に設定する。ONTデータが巨大なら分割が推奨される。

conda activate herro
scripts/preprocess.sh input_fastq out_prefix 10 <parts_to_split_job_into>

Dorado v0.5では、アダプタートリミングが追加されたため、Porechopやduplexツールを使用したアダプタートリミングや分割はおそらく将来不要になる（レポジトリより）。

outprefix.fastq.gzが得られる。

2、minimap2 alignment and batching

1の出力のロングリードを指定する。reads.idはseqkitで取得できる。20スレッド指定。

#1 seqkit seq
seqkit seq -ni outprefix.fastq.gz > reads.id

#2 minimap2 alignment and batching
scripts/create_batched_alignments.sh outprefix.fastq.gz reads.id 20 outdir

outdir/を3で指定する。

3、Error-correction

singularityイメージを使う。モデルファイル、バッチサイズ、リードと出力を指定する。推奨バッチサイズは、VRAM40GB（32GBでも可能）のGPUでは64、VRAM80GBのGPUでは128となっている。

singularity run --nv --bind <host_path>:<dest_path> herro.sif inference --read-alns outdir -t <feat_gen_threads_per_device> -d <gpus> -m model_v0.1.pt -b <batch_size> <preprocessed_reads> <fasta_output>

（レポジトリより）GPU は ID を使って指定する。例えば、パラメータ -d の値が 0,1,3 に設定された場合、herro は 1 番目、2 番目、4 番目の GPU カードを使用する。パラメータ -t はデバイスごとに与えられる - 例えば、-t が 8 に設定され、3 つの GPU が使われる場合、herro は合計で 24 の特徴生成パッドを作成する。

引用

https://github.com/lbcb-sci/herro

参考

https://nanoporetech.com/ja/resource-centre/herro-haplotype-aware-error-correction-of-ultra-long-nanopore-reads