ハプロタイプを考慮してロングリードのエラー修正を行う VeChat

2022/04/17 インストール手順修正

　エラー訂正は、ロングリードのシーケンスデータ解析における標準的な最初のステップである。現在の標準的な方法は、テンプレートとしてコンセンサス配列を使用する。しかし、メタゲノムや倍数性の高い生物のような混合サンプルでは、コンセンサスによるバイアスが、より低い頻度のハプロタイプをエラーと誤認してマスクしてしまうことがある。本発表では、配列ベースのコンセンサスではなく、グラフベースのコンセンサスをエラー同定のテンプレートとして使用することに新規性を見出した。グラフベースのリファレンスシステムは、低い頻度のバリアントも捕らえることができるため、誤ってエラーとしてマスクすることがないという利点がある。著者らは、このアイデアを実現するための新しいアプローチとして、VeChatを発表する。VeChatは、バリエーショングラフに基づいて、ハプロタイプ特異的な真のバリアントとエラーを区別する。生の入力リードからアドホックバリエーショングラフを最初に構築すると、頻出項目セットのマイニングの原則に基づく反復処理により、そのグラフからエラーに起因するノードとエッジが刈り込まれる。その結果、真のシーケンシャル現象を反映したノードとエッジのみがグラフに含まれるようになる。最終的に生のリードの再アラインメントを行うことで、どこでどのように修正する必要があるかが示される。広範なベンチマーク実験により、VeChatで修正されたPacBioおよびONTのリードは、従来の方法で修正した場合に比べて、それぞれ4～15倍、あるいは2～10倍少ないエラーしか含んでいないことが実証された。VeChatは使いやすいオープンソースツールで実装されており、https://github.com/HaploKit/vechat で一般に公開されている。

インストール

レポジトリの説明の通り、condaで環境を作って依存を導入し、vechatをビルドした (ubuntu18)。

Github

#環境を作ってyacrdとminimap2を導入する
mamba create -n vechat python=3.9 -y
conda activate vechat
mamba install -c bioconda minimap2 yacrd fpa=0.5 -y
#本体
git clone https://github.com/HaploKit/vechat.git
cd vechat
mkdir build;cd build;cmake -DCMAKE_BUILD_TYPE=Release -Dspoa_optimize_for_portability=ON ..;make
export PATH=../scripts/:$PATH

#conda (2/3現在、まだ公開されていない)
mamba create -n vechat -y
conda activate vechat
mamba install -c bioconda vechat -y

> vechat

usage: vechat [-h] [-o OUTFILE] [--platform PLATFORM] [--split] [--split-size SPLIT_SIZE] [--scrub] [-u] [--base] [--min-identity MIN_IDENTITY] [--linear] [-d MIN_CONFIDENCE] [-s MIN_SUPPORT] [--min-ovlplen-cns MIN_OVLPLEN_CNS]

[--min-identity-cns MIN_IDENTITY_CNS] [-w WINDOW_LENGTH] [-q QUALITY_THRESHOLD] [-e ERROR_THRESHOLD] [-t THREADS] [-m MATCH] [-x MISMATCH] [-g GAP] [--cudaaligner-batches CUDAALIGNER_BATCHES] [-c CUDAPOA_BATCHES] [-b]

sequences

vechat: error: the following arguments are required: sequences

(base) kazu@kazu:~/Documents/vechat/build$ vechat -h

sequences

Haplotype-aware Error Correction for Noisy Long Reads Using Variation Graphs

positional arguments:

sequences input file in FASTA/FASTQ format (can be compressed with gzip) containing sequences used for correction

optional arguments:

-h, --help show this help message and exit

-o OUTFILE, --outfile OUTFILE

output file (default: reads.corrected.fa)

--platform PLATFORM sequencing platform: pb/ont (default: pb)

--split split target sequences into chunks (recommend for FASTQ > 20G or FASTA > 10G) (default: False)

--split-size SPLIT_SIZE

split target sequences into chunks of desired size in lines, only valid when using --split (default: 1000000)

--scrub scrub chimeric reads (default: False)

-u, --include-unpolished

output unpolished target sequences (default: False)

--base perform base level alignment when computing read overlaps in the first iteration (default: False)

--min-identity MIN_IDENTITY

minimum identity used for filtering overlaps, only works combined with --base (default: 0.8)

--linear perform linear based fragment correction rather than variation graph based fragment correction (default: False)

-d MIN_CONFIDENCE, --min-confidence MIN_CONFIDENCE

minimum confidence for keeping edges in the graph (default: 0.2)

-s MIN_SUPPORT, --min-support MIN_SUPPORT

minimum support for keeping edges in the graph (default: 0.2)

--min-ovlplen-cns MIN_OVLPLEN_CNS

minimum read overlap length in the consensus round (default: 1000)

--min-identity-cns MIN_IDENTITY_CNS

minimum sequence identity between read overlaps in the consensus round (default: 0.99)

-w WINDOW_LENGTH, --window-length WINDOW_LENGTH

size of window on which POA is performed (default: 500)

-q QUALITY_THRESHOLD, --quality-threshold QUALITY_THRESHOLD

threshold for average base quality of windows used in POA (default: 10.0)

-e ERROR_THRESHOLD, --error-threshold ERROR_THRESHOLD

maximum allowed error rate used for filtering overlaps (default: 0.3)

-t THREADS, --threads THREADS

number of threads (default: 1)

-m MATCH, --match MATCH

score for matching bases (default: 5)

-x MISMATCH, --mismatch MISMATCH

score for mismatching bases (default: -4)

-g GAP, --gap GAP gap penalty (must be negative) (default: -8)

--cudaaligner-batches CUDAALIGNER_BATCHES

number of batches for CUDA accelerated alignment (default: 0)

-c CUDAPOA_BATCHES, --cudapoa-batches CUDAPOA_BATCHES

number of batches for CUDA accelerated polishing (default: 0)

-b, --cuda-banded-alignment

use banding approximation for polishing on GPU. Only applicable when -c is used. (default: False)

テストラン

CLR readsのエラー修正

cd vechat/example/
vechat reads.fq.gz -t 20 --platform pb -o reads.corrected.fa

reads.corrected.faが出力される。

ONT readsのエラー修正

vechat reads.fq.gz -t 20 --platform ont -o reads.corrected.fa

引用

VeChat: Correcting errors in long reads using variation graphs
Xiao Luo, Xiongbin Kang, Alexander Schonhuth

bioRxiv, Posted February 01, 2022

追記

VeChat: correcting errors in long reads using variation graphs

Xiao Luo, Xiongbin Kang & Alexander Schönhuth

Nature Communications volume 13, Article number: 6657 (2022)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ハプロタイプを考慮してロングリードのエラー修正を行う VeChat