ディープラーニングと隠れマルコフモデルを組み合わせて真核生物の一次遺伝子モデルの予測を行う Helixer

　遺伝子構造アノテーションはゲノム配列から生物学的知識を得るための重要なステップであり、現在でもゲノミクスプロジェクトにおいてチャレンジングな課題である。現在のde novo隠れマルコフモデルは、生物学的複雑性をモデル化する能力に限界があり、一方、現在のパイプラインはリソース集約的で、その結果は利用可能な外部データによって質が異なる。ここでは、遺伝子コーリングにディープラーニングを適用した本著者らの過去の研究を基に、DNA配列のみから一次遺伝子モデルを予測するための、完全に適用可能で、高速かつユーザーフレンドリーなツールをビルドした。その品質はstate-of-the-artであり、他のde novoツールによる予測よりも、ほとんどの指標でリファレンスに近いスコアが得られている。Helixerの予測結果はそのまま使用でき、パイプラインに組み込んでさらに品質を高めることもできる。さらに、ディープラーニングを用いた遺伝子コーリングには、さらなる改良と進歩の可能性が残されている。Helixerはオープンソースで、https://github.com/weberlab-hhu/Helixerで利用できる。ウェブインターフェースはhttps://www.plabipd.de/helixer_main.htmlでアクセスできる。

web

https://www.plabipd.de/helixer_main.html

ゲノムのfataファイルをアップロードする。1レコードの最小配列長: 25 kbp、最大ファイルサイズ1 GByteとなっている。ファイル圧縮は'.gz' '.zip' と '.bz2' がサポートされている。また「陸上植物」、「脊椎動物」、「無脊椎動物」、「真菌類」のいずれかを選択する。

オプションでGFF3形式の遺伝子アノテーション結果のプレフィックスとして使用するラベル名を指定する。またメールアドレスを指定すると、ジョブ終了後、GFF3形式の遺伝子アノテーション結果へのリンクが送られる。

ここではdemoを選択。A. lyrataのchr8が使用される。

結果が出るまでしばらく時間がかかる。

出力例

Arabidopsis_lyrata_helixer.gff

インストール

ハードウェア

現実的なサイズのデータセットの場合、許容可能なパフォーマンスを得るためには GPU が必要。提供するすべてのモデルは、11GBのメモリを搭載したNvidia GPU（GTX 1080 Tiなど）と8GbのGPU（GTX 1080など）があれば動作する。

GPU用のドライバは、以下のバージョンがHelixerで動作するが、特にこれらのバージョンをインストールする必要はない

NVIDIAドライバ-495
NVIDIAドライバ-510
NVIDIAドライバ-525
NVIDIAドライバ-555

Github

git clone https://github.com/weberlab-hhu/Helixer.git
cd Helixer/

#docker (link)
docker pull gglyptodon/helixer-docker:helixer_v0.3.4_cuda_12.2.2-cudnn8

> usage: Helixer.py [-h] [--config-path CONFIG_PATH] [--compression {gzip,lzf}] [--no-

amultiprocess] [--version] --fasta-path FASTA_PATH --gff-output-path GFF_OUTPUT_PATH [--species SPECIES] [--temporary-dir TEMPORARY_DIR] [--subsequence-length SUBSEQUENCE_LENGTH] [--write-by WRITE_BY]

[--lineage {vertebrate,land_plant,fungi,invertebrate}] [--model-filepath MODEL_FILEPATH] [--batch-size BATCH_SIZE] [--no-overlap] [--overlap-offset OVERLAP_OFFSET] [--overlap-core-length OVERLAP_CORE_LENGTH] [--debug] [--window-size WINDOW_SIZE]

[--edge-threshold EDGE_THRESHOLD] [--peak-threshold PEAK_THRESHOLD] [--min-coding-length MIN_CODING_LENGTH]

options:

-h, --help show this help message and exit

--version show program's version number and exit

Data input and output:

--config-path CONFIG_PATH

Config in form of a YAML file with lower priority than parameters given on the command line.

--fasta-path FASTA_PATH

FASTA input file.

--gff-output-path GFF_OUTPUT_PATH

Output GFF3 file path.

--species SPECIES Species name.

--temporary-dir TEMPORARY_DIR

use supplied (instead of system default) for temporary directory

Data generation parameters:

--compression {gzip,lzf}

Compression algorithm used for the intermediate .h5 output files with a fixed compression level of 4. (Default is "gzip", which is much slower than "lzf".)

--no-multiprocess Whether to not parallize the numerification of large sequences. Uses half the memory but can be much slower when many CPU cores can be utilized.

--subsequence-length SUBSEQUENCE_LENGTH

How to slice the genomic sequence. Set moderately longer than length of typical genic loci. Tested up to 213840. Must be evenly divisible by the timestep width of the used model, which is typically 9. (Default is lineage dependent from 21384 to 213840).

--write-by WRITE_BY convert genomic sequence in super-chunks to numerical matrices with this many base pairs; for lower memory consumption, which will be rounded to be divisible by subsequence-length; ; needs to be equal to or larger than subsequence length, for lower memory consumption,

consider setting a lower number

--lineage {vertebrate,land_plant,fungi,invertebrate}

What model to use for the annotation.

--model-filepath MODEL_FILEPATH

set this to override the default model for any given lineage and instead take a specific model

--no-overlap Switches off the overlapping after predictions are made. Predictions without overlapping will be faster, but will have lower quality towards the start and end of each subsequence. With this parameter --overlap-offset and --overlap-core-length will have no effect.

Prediction parameters:

--batch-size BATCH_SIZE

The batch size for the raw predictions in TensorFlow. Should be as large as possible on your GPU to save prediction time. (Default is 8.)

--overlap-offset OVERLAP_OFFSET

Offset of the overlap processing. Smaller values may lead to better predictions but will take longer. The subsequence_length should be evenly divisible by this value. (Default is subsequence_length / 2).

--overlap-core-length OVERLAP_CORE_LENGTH

Predicted sequences will be cut to this length to increase prediction quality if overlapping is enabled. Smaller values may lead to better predictions but will take longer. Has to be smaller than subsequence_length (Default is subsequence_length * 3 / 4)

--debug add this to quickly run the code through without loading/predicting on the full file

Post-processing parameters:

--window-size WINDOW_SIZE

width of the sliding window that is assessed for intergenic vs genic (UTR/Coding Sequence/Intron) content

--edge-threshold EDGE_THRESHOLD

threshold specifies the genic score which defines the start/end boundaries of each candidate region within the sliding window

--peak-threshold PEAK_THRESHOLD

threshold specifies the minimum peak genic score required to accept the candidate region; the candidate region is accepted if it contains at least one window with a genic score above this threshold

--min-coding-length MIN_CODING_LENGTH

output is filtered to remove genes with a total coding length shorter than this value

テストラン

https://github.com/gglyptodon/helixer-docker

docker run --gpus all -itv --rm gglyptodon/helixer-docker:helixer_v0.3.4_cuda_12.2.2-cudnn8

Helixer/scripts/fetch_helixer_models.py

cd shared/out/
curl -L ftp://ftp.ensemblgenomes.org/pub/plants/release-47/fasta/arabidopsis_lyrata/dna/Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz --output Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz

準備ができたら実行する。

Helixer.py --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa.gz --lineage land_plant --gff-output-path Arabidopsis_lyrata_chromosome8_helixer.gff3

出力例

Arabidopsis_lyrata.v.1.0.dna.chromosome.8のランにGPU無しで10分ほどかかった。

引用

Helixer–de novo Prediction of Primary Eukaryotic Gene Models Combining Deep Learning and a Hidden Markov Model

Felix Holst, Anthony Bolger, Christopher Günther, Janina Maß, Sebastian Triesch, Felicitas Kindel, Niklas Kiel, Nima Saadat, Oliver Ebenhöh, Björn Usadel, Rainer Schwacke, Marie Bolger

bioRxiv, Posted February 09, 2023.

参考

統合TV