Deep learningによって高速かつ精度の高いオーソロガスタンパク質のアサインメントを行う DeepNOG

　タンパク質オロソログ群データベースは、進化解析、機能アノテーション、または系統を超えた代謝パスウェイのモデリングのための強力なツールである。また、配列は通常、プロファイル隠れマルコフモデルなどのアライメントベースの手法でオーソロガスなグループに割り当てられているが、これが計算上のボトルネックとなっていた。
　ここでは、深層畳み込みネットワークに基づいた非常に高速で正確なアライメントフリーのオルソロジーアサイン法であるDeepNOGを紹介する。2つのオルソロジーデータベース(COG, eggNOG5)を用いて、最新のアライメントベース手法(HMMER, DIAMOND)およびアライメントフリー手法(DeepFam)とDeepNOGを比較した。DeepNOGは、eggNOGのような大規模なオルソロジーデータベースにスケールアップすることができ、精度と再現性の点でDeepFamを大きく上回った。アライメントベースの手法は、調査した手法の中で最も正確なアサイン法を提供するが、DeepNOGの計算時間はCPUの方が桁違いに短くなっていた。オプションのGPUを使用することで、スループットがさらに大幅に向上する。コマンドラインツールにより、ユーザによる迅速な導入が可能である。
　ソースコードとパッケージは https://github.com/univieCUBE/deepnog で自由に入手できる。プラットフォームに依存しないPythonプログラムを$pip install deepnogでインストールできる。

インストール

condaの仮想環境を作って導入し、CPU版をテストした（osはubuntu18.04LTS）。

依存

本体　Github

#conda(link)
conda create -n deepnog python=3.8 -y
conda activate deepnog
conda install -c bioconda deepnog

> deepnog -h

$ deepnog -h

usage: deepnog [-h] [-v] {train,infer} ...

Assign protein sequences to orthologous groups with deep learning.

positional arguments:

{train,infer}

train Train a model for a custom database.

infer Infer protein orthologous groups

optional arguments:

-h, --help show this help message and exit

-v, --version show program's version number and exit

> deepnog infer -h

usage: deepnog infer [-h] [-ff FILEFORMAT] [-V VERBOSE] [-d {auto,cpu,gpu}] [-nw NUM_WORKERS] [-a {deepnog,deepencoding,deepfam,deepfam_light}] [-w WEIGHTS_FILE] [-bs BATCH_SIZE] [-o OUT_FILE]

[-db {eggNOG5,cog2020}] [-t TAXONOMIC_LEVEL] [--test_labels TEST_LABELS_FILE] [-of {csv,tsv,legacy}] [-c CONFIDENCE]

SEQUENCE_FILE

positional arguments:

SEQUENCE_FILE File containing protein sequences for orthology inference.

optional arguments:

-h, --help show this help message and exit

-ff FILEFORMAT, --fformat FILEFORMAT

File format of protein sequences. Must be supported by Biopythons Bio.SeqIO class.

-V VERBOSE, --verbose VERBOSE

Define verbosity of DeepNOGs output written to stdout or stderr. 0 only writes errors to stderr which cause DeepNOG to abort and exit. 1 also writes warnings to stderr if

e.g. a protein without an ID was found and skipped. 2 additionally writes general progress messages to stdout. 3 includes a dynamic progress bar of the prediction stage

using tqdm.

-d {auto,cpu,gpu}, --device {auto,cpu,gpu}

Define device for calculating protein sequence classification. Auto chooses GPU if available, otherwise CPU.

-nw NUM_WORKERS, --num-workers NUM_WORKERS

Number of subprocesses (workers) to use for data loading. Set to a value <= 0 to use single-process data loading. Note: Only use multi-process data loading if you are

calculating on a gpu (otherwise inefficient)!

-a {deepnog,deepencoding,deepfam,deepfam_light}, --architecture {deepnog,deepencoding,deepfam,deepfam_light}

Network architecture to use for classification.

-w WEIGHTS_FILE, --weights WEIGHTS_FILE

Custom weights file path (optional)

-bs BATCH_SIZE, --batch-size BATCH_SIZE

The batch size determines how many sequences are processed by the network at once. If 1, process the protein sequences sequentially (recommended on CPUs). Larger batch

sizes speed up the inference and training on GPUs. Batch size can influence the learning process.

-o OUT_FILE, --out OUT_FILE

Store orthologous group predictions to outputfile. Per default, write predictions to stdout.

-db {eggNOG5,cog2020}, --database {eggNOG5,cog2020}

Orthologous group/family database to use.

-t TAXONOMIC_LEVEL, --tax TAXONOMIC_LEVEL

Taxonomic level to use in specified database, e.g. 1 = root, 2 = bacteria

--test_labels TEST_LABELS_FILE

Measure model performance on a test set. If provided, this file must contain the ground-truth labels for the provided sequences. Otherwise, only perform inference.

-of {csv,tsv,legacy}, --outformat {csv,tsv,legacy}

Output file format

-c CONFIDENCE, --confidence-threshold CONFIDENCE

If provided, predictions below the threshold are discarded.By default, any confidence threshold stored in the model is applied, if present.

> deepnog train -h

$ deepnog train -h

usage: deepnog train [-h] [-ff FILEFORMAT] [-V VERBOSE] [-d {auto,cpu,gpu}] [-nw NUM_WORKERS] [-a {deepnog,deepencoding,deepfam,deepfam_light}] [-w WEIGHTS_FILE] [-bs BATCH_SIZE] -o OUT_DIR -db

DATABASE_NAME -t TAXONOMIC_LEVEL [-e N_EPOCHS] [-s] [-lr LEARNING_RATE] [-g LEARNING_RATE_DECAY] [-l2 λ] [-r RANDOM_SEED] [--save-each-epoch]

TRAIN_SEQUENCE_FILE VAL_SEQUENCE_FILE TRAIN_LABELS_FILE VAL_LABELS_FILE

positional arguments:

TRAIN_SEQUENCE_FILE File containing protein sequences training set.

VAL_SEQUENCE_FILE File containing protein sequences validation set.

TRAIN_LABELS_FILE Orthologous group labels for training set protein sequences.

VAL_LABELS_FILE Orthologous group labels for training and validation set protein sequences. Both training and validation labels Must be in CSV files that are parseable by

pandas.read_csv(..., index_col=1). The first column must be a numerical index. The other columns should be named 'protein_id' and 'eggnog_id', or be in order

sequence_identifier first, label_identifier second.

optional arguments:

-h, --help show this help message and exit

-ff FILEFORMAT, --fformat FILEFORMAT

File format of protein sequences. Must be supported by Biopythons Bio.SeqIO class.

-V VERBOSE, --verbose VERBOSE

Define verbosity of DeepNOGs output written to stdout or stderr. 0 only writes errors to stderr which cause DeepNOG to abort and exit. 1 also writes warnings to stderr if

e.g. a protein without an ID was found and skipped. 2 additionally writes general progress messages to stdout. 3 includes a dynamic progress bar of the prediction stage

using tqdm.

-d {auto,cpu,gpu}, --device {auto,cpu,gpu}

Define device for calculating protein sequence classification. Auto chooses GPU if available, otherwise CPU.

-nw NUM_WORKERS, --num-workers NUM_WORKERS

Number of subprocesses (workers) to use for data loading. Set to a value <= 0 to use single-process data loading. Note: Only use multi-process data loading if you are

calculating on a gpu (otherwise inefficient)!

-a {deepnog,deepencoding,deepfam,deepfam_light}, --architecture {deepnog,deepencoding,deepfam,deepfam_light}

Network architecture to use for classification.

-w WEIGHTS_FILE, --weights WEIGHTS_FILE

Custom weights file path (optional)

-bs BATCH_SIZE, --batch-size BATCH_SIZE

The batch size determines how many sequences are processed by the network at once. If 1, process the protein sequences sequentially (recommended on CPUs). Larger batch

sizes speed up the inference and training on GPUs. Batch size can influence the learning process.

-o OUT_DIR, --out OUT_DIR

Store training results to files in the given directory. Results include the trained model,training/validation loss and accuracy values,and the ground truth plus predicted

classes per training epoch, if requested.

-db DATABASE_NAME, --database DATABASE_NAME

Orthologous group database name

-t TAXONOMIC_LEVEL, --tax TAXONOMIC_LEVEL

Taxonomic level in specified database

-e N_EPOCHS, --n-epochs N_EPOCHS

Number of training epochs, that is, passes over the complete data set.

-s, --shuffle Shuffle the training sequences. Note that a shuffle buffer is used in combination with an iterable dataset. That is, not all sequences have equal probability to be chosen.

If you have highly structured sequence files consider shuffling them in advance. Default buffer size = 65536

-lr LEARNING_RATE, --learning-rate LEARNING_RATE

Initial learning rate, subject to adaptations by chosen optimizer and scheduler.

-g LEARNING_RATE_DECAY, --gamma LEARNING_RATE_DECAY

Decay for learning rate step scheduler. (lr_epoch_t2 = gamma * lr_epoch_t1)

-l2 λ, --l2-coeff λ Regularization coefficient λ for L2 regularization. If None, L2 regularization is disabled.

-r RANDOM_SEED, --random-seed RANDOM_SEED

Seed the random number generators of numpy and PyTorch during training for reproducibility. Also affects cuDNN determinism. Default: None (disables reproducibility)

--save-each-epoch Save the model after each epoch.

実行方法

タンパク質配列を指定する。デフォルトではeggNOG5データベースのバクテリアレベルが使用される。

deepnog infer proteins.faa -db eggNOG5 -t 2 --out prediction.csv

-db {eggNOG5, cog2020} Orthologous group/family database to use.
-t Taxonomic level to use in specified database, e.g. 1 = root, 2 = bacteria

引用

DeepNOG: Fast and accurate protein orthologous group assignment
Roman Feldbauer, Lukas Gosch, Lukas Lüftinger, Patrick Hyden, Arthur Flexer, Thomas Rattei
Bioinformatics, Published: 26 December 2020