Nanopolishのcall-methylationおよびeventalignモジュールを最適化して再実装した f5c

　ナノポアシーケンスにより、ポイントオブケア診断や現場でのジェノタイピングなど、携帯可能なリアルタイムシーケンスアプリケーションが可能になる。このような成果を得るためには、生のナノポアシグナルデータを解析するための効率的なバイオインフォマティクスアルゴリズムが必要になる。しかし、生のナノポアシグナルを生物学的参照配列と比較することは、計算上複雑なタスクである。ABEA（Adaptive Banded Event Alignment）と呼ばれる動的プログラミングアルゴリズムは、シーケンスデータを研磨し、DNAメチル化の測定など、非標準のヌクレオチドを特定するのに重要なステップである。ここでは、ABEAアルゴリズムの実装（f5cと呼ぶ）を並列化し、ヘテロジニアスCPU-GPU アーキテクチャ上で効率的に動作するように最適化した。
メモリ、計算、CPUとGPU間の負荷分散を最適化することにより、f5cはNanopolishソフトウェアパッケージのオリジナルのCPUのみのABEA実装を最適化したバージョンよりも約3-5倍高速に動作することを実証した。また、GPUを搭載した組込みSoC（System on Chip）を用いて、f5cがオンザフライでDNAメチル化検出を可能にすることも示した。
本研究は、複雑なゲノム解析が軽量なコンピューティングシステムで実行できることを示すだけでなく、高性能コンピューティング（HPC）にも貢献するものである。f5cのソースコードとGPU最適化ABEAは、https://github.com/hasindu2008/f5c で公開されている。

f5c-v1.1 is released and features a new module called resquiggle that aligns raw-signals to basecalled reads [contributed by @hiruna72].https://t.co/MPOmKjfxfT
— Hasindu Gamaarachchi (@Hasindu2008) September 10, 2022

Full Documentation

https://hasindu2008.github.io/f5c/docs/overview

Commnad reference

https://hasindu2008.github.io/f5c/docs/commands

インストール

ubuntu18にてCPU版のバイナリを試した。

Github

VERSION=v1.1
wget "https://github.com/hasindu2008/f5c/releases/download/$VERSION/f5c-$VERSION-binaries.tar.gz" && tar xvf f5c-$VERSION-binaries.tar.gz && cd f5c-$VERSION/

> ./f5c_x86_64_linux -h

Usage: f5c <command> [options]

command:

index Build an index for accessing the base sequence and raw signal for a given read ID (optimised nanopolish index)

call-methylation Classify nucleotides as methylated or not (optimised nanopolish call-methylation)

meth-freq Calculate methylation frequency at genomic CpG sites (optimised nanopolish calculate_methylation_frequency.py)

eventalign Align nanopore events to reference k-mers (optimised nanopolish eventalign)

freq-merge Merge calculated methylation frequency tsv files

resquiggle Align raw signals to basecalled reads

See the manual page for details (`man ./docs/f5c.1' or https://f5c.page.link/man).

> ./f5c_x86_64_linux index

[f5c index] not enough arguments

Usage: f5c index [OPTIONS] -d fast5_directory reads.fastq

f5c index [OPTIONS] --slow5 signals.blow5 reads.fastq

Build an index for accessing the base sequence (fastq/fasta) and raw signal (fast5/slow5) for a given read ID. f5c index is an extended and optimised version of nanopolish index by Jared Simpson

-h display this help and exit

-d STR path to the directory containing fast5 files. This option can be given multiple times.

-s STR the sequencing summary file

-f STR file containing the paths to the sequencing summary files (one per line)

-t INT number of threads used for bgzf compression (makes indexing faster)

--iop INT number of I/O processes to read fast5 files (makes fast5 indexing faster)

--slow5 FILE slow5 file containing raw signals

--skip-slow5-idx do not build the .idx for the slow5 file (useful when a slow5 index is already available)

--verbose INT verbosity level

--version print version

See the manual page for details (`man ./docs/f5c.1' or https://f5c.page.link/man).

> ./f5c_x86_64_linux call-methylation

[meth_main::INFO] Default methylation tsv output format is changed from f5c v0.7 onwards to match latest nanopolish output. Set --meth-out-version=1 to fall back to the old format.

Usage: f5c call-methylation [OPTIONS] -r reads.fa -b alignments.bam -g genome.fa

f5c call-methylation [OPTIONS] -r reads.fa -b alignments.bam -g genome.fa --slow5 signals.blow5

basic options:

-r FILE fastq/fasta read file

-b FILE sorted bam file

-g FILE reference genome

-w STR limit processing to a genomic region specified as chr:start-end or a list of regions in a .bed file

-t INT number of processing threads [8]

-K INT batch size (max number of reads loaded at once) [512]

-B FLOAT[K/M/G] max number of bases loaded at once [5.0M]

-h help

-o FILE output to file [stdout]

-x STR parameter profile to be used for better performance (always applied before other options)

e.g., laptop, desktop, hpc; see https://f5c.page.link/profiles for the full list

--iop INT number of I/O processes to read fast5 files [1]

--slow5 FILE read from a slow5 file

--min-mapq INT minimum mapping quality [20]

--secondary=yes|no consider secondary mappings or not [no]

--verbose INT verbosity level [0]

--version print version

advanced options:

--skip-ultra FILE skip ultra long reads and write those entries to the bam file provided as the argument

--ultra-thresh INT threshold to skip ultra long reads [100000]

--skip-unreadable=yes|no skip any unreadable fast5 or terminate program [yes]

--kmer-model FILE custom nucleotide k-mer model file (format similar to test/r9-models/r9.4_450bps.nucleotide.6mer.template.model)

--meth-model FILE custom methylation k-mer model file (format similar to test/r9-models/r9.4_450bps.cpg.6mer.template.model)

--meth-out-version INT methylation tsv output version (set 2 to print the strand column) [2]

--min-recalib-events INT minimum number of events to recalbrate (decrease if your reads are very short and could not calibrate) [200]

developer options:

--print-events=yes|no prints the event table

--print-banded-aln=yes|no prints the event alignment

--print-scaling=yes|no prints the estimated scalings

--print-raw=yes|no prints the raw signal

--debug-break INT break after processing the specified no. of batches

--profile-cpu=yes|no process section by section (used for profiling on CPU)

--write-dump=yes|no write the fast5 dump to a file or not

--read-dump=yes|no read from a fast5 dump file or not

See the manual page for details (`man ./docs/f5c.1' or https://f5c.page.link/man).

テストラン

https://hasindu2008.github.io/f5c/docs/example-usageに従う。公式チュートリアルではマッピングのコマンドや計算資源がリッチな環境でのコマンド例も書かれている。

１，Download

wget -O f5c_na12878_test.tgz "https://f5c.page.link/f5c_na12878_test"
tar xf f5c_na12878_test.tgz

２、indexingとmethylation-call

index作成には、fast5（かもしくはslow5 format）のファイルを含むディレクトリとbasecallされたfastqを指定する。次のmethylation-callでは、basecallされたfastqをリファレンスにマッピングして得たbamファイル、fasta形式のリファレンス配列、basecallされたfastqを指定する。

#1 indexing (nanopolish indexの出力と同等)
f5c index -d chr22_meth_example/fast5_files chr22_meth_example/reads.fastq

#2 ゲノムCpG部位のメチル化有無の分類を行う（Nanopolishの出力と同等）
f5c call-methylation -b chr22_meth_example/reads.sorted.bam -g chr22_meth_example/humangenome.fa -r chr22_meth_example/reads.fastq > chr22_meth_example/result.tsv

#3 f5c call-methylationで生成されたメチル化コールを含むtsvファイルから、ゲノムCpG部位のメチル化頻度を計算する
f5c meth-freq -i chr22_meth_example/result.tsv > chr22_meth_example/freq.tsv

#4複数のメチル化頻度tsvファイル（f5c meth-freqの出力）がある場合、1つのtsvファイルにマージする
f5c freq-merge

出力

３、event alignment

f5c eventalignは、生のナノポアシグナル（イベント）を参照k-merにアラインさせる。

f5c eventalign -b chr22_meth_example/reads.sorted.bam -g chr22_meth_example/humangenome.fa -r chr22_meth_example/reads.fastq > chr22_meth_example/events.tsv

Nanopolishの出力と同等。

４、f5c resquiggle

f5c resquiggleはrawシグナルをbasecalled readsにアライメントする（現在開発中で、将来のバージョンで出力は変更される可能性がある）。

f5c_x86_64_linux resquiggle chr22_meth_example/reads.fastq chr22_meth_example/reads.blow5

デフォルトの出力はTSV形式

https://hasindu2008.github.io/f5c/docs/output

引用

GPU accelerated adaptive banded event alignment for rapid comparative nanopore signal analysis
Hasindu Gamaarachchi, Chun Wai Lam, Gihan Jayatilaka, Hiruna Samarakoon, Jared T. Simpson, Martin A. Smith & Sri Parameswaran
BMC Bioinformatics volume 21, Article number: 343 (2020)