macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Deep learningによって高速かつ精度の高いオーソロガスタンパク質のアサインメントを行う DeepNOG

 

 タンパク質オロソログ群データベースは、進化解析、機能アノテーション、または系統を超えた代謝パスウェイのモデリングのための強力なツールである。また、配列は通常、プロファイル隠れマルコフモデルなどのアライメントベースの手法でオーソロガスなグループに割り当てられているが、これが計算上のボトルネックとなっていた。
 ここでは、深層畳み込みネットワークに基づいた非常に高速で正確なアライメントフリーのオルソロジーアサイン法であるDeepNOGを紹介する。2つのオルソロジーデータベース(COG, eggNOG5)を用いて、最新のアライメントベース手法(HMMER, DIAMOND)およびアライメントフリー手法(DeepFam)とDeepNOGを比較した。DeepNOGは、eggNOGのような大規模なオルソロジーデータベースにスケールアップすることができ、精度と再現性の点でDeepFamを大きく上回った。アライメントベースの手法は、調査した手法の中で最も正確なアサイン法を提供するが、DeepNOGの計算時間はCPUの方が桁違いに短くなっていた。オプションのGPUを使用することで、スループットがさらに大幅に向上する。コマンドラインツールにより、ユーザによる迅速な導入が可能である。
 ソースコードとパッケージは https://github.com/univieCUBE/deepnog で自由に入手できる。プラットフォームに依存しないPythonプログラムを$pip install deepnogでインストールできる。

 

インストール

condaの仮想環境を作って導入し、CPU版をテストした(osはubuntu18.04LTS)。

依存

本体 Github

#conda(link)
conda create -n deepnog python=3.8 -y
conda activate deepnog
conda install -c bioconda deepnog

deepnog -h

$ deepnog -h

usage: deepnog [-h] [-v] {train,infer} ...

 

Assign protein sequences to orthologous groups with deep learning.

 

positional arguments:

  {train,infer}

    train        Train a model for a custom database.

    infer        Infer protein orthologous groups

 

optional arguments:

  -h, --help     show this help message and exit

  -v, --version  show program's version number and exit

deepnog infer -h

usage: deepnog infer [-h] [-ff FILEFORMAT] [-V VERBOSE] [-d {auto,cpu,gpu}] [-nw NUM_WORKERS] [-a {deepnog,deepencoding,deepfam,deepfam_light}] [-w WEIGHTS_FILE] [-bs BATCH_SIZE] [-o OUT_FILE]

                     [-db {eggNOG5,cog2020}] [-t TAXONOMIC_LEVEL] [--test_labels TEST_LABELS_FILE] [-of {csv,tsv,legacy}] [-c CONFIDENCE]

                     SEQUENCE_FILE

 

positional arguments:

  SEQUENCE_FILE         File containing protein sequences for orthology inference.

 

optional arguments:

  -h, --help            show this help message and exit

  -ff FILEFORMAT, --fformat FILEFORMAT

                        File format of protein sequences. Must be supported by Biopythons Bio.SeqIO class.

  -V VERBOSE, --verbose VERBOSE

                        Define verbosity of DeepNOGs output written to stdout or stderr. 0 only writes errors to stderr which cause DeepNOG to abort and exit. 1 also writes warnings to stderr if

                        e.g. a protein without an ID was found and skipped. 2 additionally writes general progress messages to stdout. 3 includes a dynamic progress bar of the prediction stage

                        using tqdm.

  -d {auto,cpu,gpu}, --device {auto,cpu,gpu}

                        Define device for calculating protein sequence classification. Auto chooses GPU if available, otherwise CPU.

  -nw NUM_WORKERS, --num-workers NUM_WORKERS

                        Number of subprocesses (workers) to use for data loading. Set to a value <= 0 to use single-process data loading. Note: Only use multi-process data loading if you are

                        calculating on a gpu (otherwise inefficient)!

  -a {deepnog,deepencoding,deepfam,deepfam_light}, --architecture {deepnog,deepencoding,deepfam,deepfam_light}

                        Network architecture to use for classification.

  -w WEIGHTS_FILE, --weights WEIGHTS_FILE

                        Custom weights file path (optional)

  -bs BATCH_SIZE, --batch-size BATCH_SIZE

                        The batch size determines how many sequences are processed by the network at once. If 1, process the protein sequences sequentially (recommended on CPUs). Larger batch

                        sizes speed up the inference and training on GPUs. Batch size can influence the learning process.

  -o OUT_FILE, --out OUT_FILE

                        Store orthologous group predictions to outputfile. Per default, write predictions to stdout.

  -db {eggNOG5,cog2020}, --database {eggNOG5,cog2020}

                        Orthologous group/family database to use.

  -t TAXONOMIC_LEVEL, --tax TAXONOMIC_LEVEL

                        Taxonomic level to use in specified database, e.g. 1 = root, 2 = bacteria

  --test_labels TEST_LABELS_FILE

                        Measure model performance on a test set. If provided, this file must contain the ground-truth labels for the provided sequences. Otherwise, only perform inference.

  -of {csv,tsv,legacy}, --outformat {csv,tsv,legacy}

                        Output file format

  -c CONFIDENCE, --confidence-threshold CONFIDENCE

                        If provided, predictions below the threshold are discarded.By default, any confidence threshold stored in the model is applied, if present.

> deepnog train -h

$ deepnog train -h

usage: deepnog train [-h] [-ff FILEFORMAT] [-V VERBOSE] [-d {auto,cpu,gpu}] [-nw NUM_WORKERS] [-a {deepnog,deepencoding,deepfam,deepfam_light}] [-w WEIGHTS_FILE] [-bs BATCH_SIZE] -o OUT_DIR -db

                     DATABASE_NAME -t TAXONOMIC_LEVEL [-e N_EPOCHS] [-s] [-lr LEARNING_RATE] [-g LEARNING_RATE_DECAY] [-l2 λ] [-r RANDOM_SEED] [--save-each-epoch]

                     TRAIN_SEQUENCE_FILE VAL_SEQUENCE_FILE TRAIN_LABELS_FILE VAL_LABELS_FILE

 

positional arguments:

  TRAIN_SEQUENCE_FILE   File containing protein sequences training set.

  VAL_SEQUENCE_FILE     File containing protein sequences validation set.

  TRAIN_LABELS_FILE     Orthologous group labels for training set protein sequences.

  VAL_LABELS_FILE       Orthologous group labels for training and validation set protein sequences. Both training and validation labels Must be in CSV files that are parseable by

                        pandas.read_csv(..., index_col=1). The first column must be a numerical index. The other columns should be named 'protein_id' and 'eggnog_id', or be in order

                        sequence_identifier first, label_identifier second.

 

optional arguments:

  -h, --help            show this help message and exit

  -ff FILEFORMAT, --fformat FILEFORMAT

                        File format of protein sequences. Must be supported by Biopythons Bio.SeqIO class.

  -V VERBOSE, --verbose VERBOSE

                        Define verbosity of DeepNOGs output written to stdout or stderr. 0 only writes errors to stderr which cause DeepNOG to abort and exit. 1 also writes warnings to stderr if

                        e.g. a protein without an ID was found and skipped. 2 additionally writes general progress messages to stdout. 3 includes a dynamic progress bar of the prediction stage

                        using tqdm.

  -d {auto,cpu,gpu}, --device {auto,cpu,gpu}

                        Define device for calculating protein sequence classification. Auto chooses GPU if available, otherwise CPU.

  -nw NUM_WORKERS, --num-workers NUM_WORKERS

                        Number of subprocesses (workers) to use for data loading. Set to a value <= 0 to use single-process data loading. Note: Only use multi-process data loading if you are

                        calculating on a gpu (otherwise inefficient)!

  -a {deepnog,deepencoding,deepfam,deepfam_light}, --architecture {deepnog,deepencoding,deepfam,deepfam_light}

                        Network architecture to use for classification.

  -w WEIGHTS_FILE, --weights WEIGHTS_FILE

                        Custom weights file path (optional)

  -bs BATCH_SIZE, --batch-size BATCH_SIZE

                        The batch size determines how many sequences are processed by the network at once. If 1, process the protein sequences sequentially (recommended on CPUs). Larger batch

                        sizes speed up the inference and training on GPUs. Batch size can influence the learning process.

  -o OUT_DIR, --out OUT_DIR

                        Store training results to files in the given directory. Results include the trained model,training/validation loss and accuracy values,and the ground truth plus predicted

                        classes per training epoch, if requested.

  -db DATABASE_NAME, --database DATABASE_NAME

                        Orthologous group database name

  -t TAXONOMIC_LEVEL, --tax TAXONOMIC_LEVEL

                        Taxonomic level in specified database

  -e N_EPOCHS, --n-epochs N_EPOCHS

                        Number of training epochs, that is, passes over the complete data set.

  -s, --shuffle         Shuffle the training sequences. Note that a shuffle buffer is used in combination with an iterable dataset. That is, not all sequences have equal probability to be chosen.

                        If you have highly structured sequence files consider shuffling them in advance. Default buffer size = 65536

  -lr LEARNING_RATE, --learning-rate LEARNING_RATE

                        Initial learning rate, subject to adaptations by chosen optimizer and scheduler.

  -g LEARNING_RATE_DECAY, --gamma LEARNING_RATE_DECAY

                        Decay for learning rate step scheduler. (lr_epoch_t2 = gamma * lr_epoch_t1)

  -l2 λ, --l2-coeff λ   Regularization coefficient λ for L2 regularization. If None, L2 regularization is disabled.

  -r RANDOM_SEED, --random-seed RANDOM_SEED

                        Seed the random number generators of numpy and PyTorch during training for reproducibility. Also affects cuDNN determinism. Default: None (disables reproducibility)

  --save-each-epoch     Save the model after each epoch.

 

 

実行方法

タンパク質配列を指定する。デフォルトではeggNOG5データベースのバクテリアレベルが使用される。

deepnog infer proteins.faa -db eggNOG5 -t 2 --out prediction.csv
  • -db {eggNOG5, cog2020}   Orthologous group/family database to use.
  • -t    Taxonomic level to use in specified database, e.g. 1 = root, 2 = bacteria 

 

 

引用

DeepNOG: Fast and accurate protein orthologous group assignment
Roman Feldbauer, Lukas Gosch, Lukas Lüftinger, Patrick Hyden, Arthur Flexer, Thomas Rattei
Bioinformatics, Published: 26 December 2020

 

関連


今年も色々な方にお世話になりました。来年もよろしくお願い致します。

どうぞ、良いお年をお迎えください。