隠れマルコフモデルを用いてオルソログ推論を改善する OrthoHMM

　正確なオルソロジー推定は、比較ゲノム学や系統学にとって不可欠である。しかし、オルソロジーの推定は、古くから分岐している生物の間で顕著な配列の分岐によって困難が伴う。OrthoHMMは、置換行列をパラメータとする隠れマルコフモデルを用いてオルソログ遺伝子群を推論するアルゴリズムであり、リモートホモログの検出を可能にする。ベンチマークによると、OrthoHMMは現在利用可能な方法を凌駕している。例えば、Bilaterian orthogroupsのキュレーションセットを使用した場合、OrthoHMMは10.3 - 138.9%の精度向上を示した。Bilaterianオルソグループと、3つの主要な真核生物のオルソグループからなる新しいデータセットを使用したランクベースのベンチマークでは、OrthoHMMが最高の総合的なパフォーマンスを示した（6.7 - 97.8％の総合的な向上）。これらの結果は、隠れマルコフモデルがオルソグループ推論を改善することを示唆している。

NEW preprint!🥳 Orthogroups are a prerequisite for comparative genomics and Tree of Life inquiries

Introducing #OrthoHMM, software that improves orthogroup inference, which may refine our understanding of genome evolution and the Tree of Life

🔗 https://t.co/pd3NQFRe5X pic.twitter.com/rttz5kH3BU
— 🧬Jacob L Steenwyk (@jlsteenwyk) 2024年12月17日

Here is the general OrthoHMM workflow. It implements many of the same measures as other algorithms, like OrthoFinder.

Panels b-e are managed by OrthoHMM. Thus, users need to minimally specify the path to a directory w/ the files they want to use — that's it. pic.twitter.com/7W7Ih84mZj
— 🧬Jacob L Steenwyk (@jlsteenwyk) 2024年12月17日

OrthoHMM's advantage over other software is that it uses HMMER during all-by-all searches.

We don't generate HMMs from multiple sequence alignments to speed up computation. Instead, HMMs are built from single-sequences & substitution matrices. These HMMs often outperform BLAST
— 🧬Jacob L Steenwyk (@jlsteenwyk) 2024年12月17日

Documentation

https://jlsteenwyk.com/orthohmm/

FAQ

https://jlsteenwyk.com/orthohmm/frequently_asked_questions/index.html

インストール

mambaでpython3.11の環境を作ってpipで導入した（3.12だとdistutilsがないエラーが出た）。

Github

#PyPI(link)
mamba create -n orthohmm python=3.11 -y
conda activate orthohmm
pip install orthohmm
mamba install -c bioconda mcl #MCLが必要(link):あるいは-mでMCLのパスを指定

> orthohmm -h

$ orthohmm -h

____ _ _ _ _ __ __ __ __

/ __ \ | | | | | | | | \/ | \/ |

| | | |_ __| |_| |__ ___ | |__| | \ / | \ / |

| | | | '__| __| '_ \ / _ \| __ | |\/| | |\/| |

| |__| | | | |_| | | | (_) | | | | | | | | | |

\____/|_| \__|_| |_|\___/|_| |_|_| |_|_| |_|

Version: 0.0.0

Citation: Steenwyk et al. YEAR, JOURNAL. doi: DOI

LINK

HMM-based inference of orthologous groups.

Usage: orthohmm <input> [optional arguments]

required argument:

<input_directory> Directory of FASTA files ending in

.fa, .faa, .fas, .fasta, .pep, or .prot

(must be the first argument)

optional arguments:

-o, --output_directory <path> output directory name

(default: same directory as

directory of FASTA files)

-p, --phmmer <path> path to phmmer from HMMER suite

(default: phmmer)

-e, --evalue <float> e-value threshold to use for

phmmer search results

(default: 0.0001)

-x, --substitution_matrix <subs. matrix> substitution matrix to use for

residue probabilities

(default: BLOSUM62)

-c, --cpu <integer> number of parallel CPU workers

to use for multithreading

(default: auto detect)

-s, --single_copy_threshold <float> taxon occupancy threshold

for single-copy orthologs

(default: 0.5)

-m, --mcl <path> path to mcl software

(default: mcl)

-i, --inflation_value <float> mcl inflation parameter

(default: 1.5)

--stop <prepare, infer, write> options for stopping

an analysis at a specific

intermediate step

--start <search_res> start analysis from

completed all-vs-all

search results

-------------------------------------

| Detailed explanation of arguments |

-------------------------------------

Input Directory (first argument)

A directory that contains FASTA files of protein sequences that

also have the extensions .fa, .faa, .fas, .fasta, .pep, or .prot.

OrthoHMM will automatically identify files with these extensions

and use them for analyses.

Output Directory (-o, --output_directory)

Output directory name to store OrthoHMM results. This directory

should already exist. By default, results files will be written

to the same directory as the input directory of FASTA files.

Phmmer (-p, --phmmer)

Path to phmmer executable from HMMER suite. By default, phmmer

is assumed to be in the PATH variable; in other words, phmmer

can be evoked by typing `phmmer`.

E-value Threshold (-e, --evalue)

E-value threshold to use when filtering phmmer results. E-value

thresholds are applied after searches are made. This is done so

that users can change the e-value threshold if they are using

the --start argument.

Substitution Matrix (-x, --substitution_matrix)

Residue alignment probabilities will be determined from the

specified substitution matrix. Supported substitution matrices

include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90,

PAM30, PAM70, PAM120, and PAM240. The default is BLOSUM62.

CPU (-c, --cpu)

Number of CPU workers for multithreading during sequence search.

This argument is used by phmmer during all-by-all comparisons.

By default, the number of CPUs available will be auto-detected.

Single-Copy Threshold (-s, --single_copy_threshold)

Taxon occupancy threshold when identifying single-copy orthologs.

By default, the threshold is 50% taxon occupancy, which is specified

as a fraction - that is, 0.5.

MCL (-m, --mcl)

Path to mcl executable from MCL software. By default, mcl

is assumed to be in the PATH variable; in other words,

mcl can be evoked by typing `mcl`.

Inflation Value (-i, --inflation_value)

MCL inflation parameter for clustering genes into orthologous groups.

Lower values are more permissive resulting in larger orthogroups.

Higher values are stricter resulting in smaller orthogroups.

The default value is 1.5.

Stop (--stop)

Similar to other ortholog calling algorithms, different steps in the

OrthoHMM workflow can be cpu or memory intensive. Thus, users may

want to stop OrthoHMM at certain steps, to faciltiate more

practical resource allocation. There are three choices for when to

stop the analysis: prepare, infer, and write.

- prepare: Stop after preparing input files for the all-by-all search

- infer: Stop after inferring the orthogroups

- write: Stop after writing sequence files for the orthogroups.

Currently, this is synonymous with not specifying a step

to stop the analysis at.

Start (--start)

Start analysis from a specific intermediate step. Currently, this

can only be applied to the results from the all-by-all search.

- search_res: Start analysis from all-by-all search results.

-------------------

| OrthoHMM output |

-------------------

All OrthoHMM outputs have the prefix `orthohmm` so that they are easy to find.

orthohmm_gene_count.txt

A gene count matrix per taxa for each orthogroup. Space delimited.

orthohmm_orthogroups.txt

Genes present in each orthogroup. Space delimited.

orthohmm_single_copy_orthogroups.txt

A single-column list of single-copy orthologs.

orthohmm_orthogroups

A directory of FASTA files wherein each file is an orthogroup.

orthohmm_single_copy_orthogroups

A directory of FASTA files wherein each file is a single-copy ortholog.

Headers are modified to have taxon names come before the gene identifier.

Taxon names are the file name excluding the extension. Taxon name and gene

identifier are separated by a pipe symbol "|". This aims to help streamline

phylogenomic workflows wherein sequences will be concatenated downstream

based on taxon names.

orthohmm_working_res

Various intermediate results files that help OrthoHMM start analyses

from an intermediate step in the analysis. This includes outputs

from phmmer searches, initial edges inputted to MCL, and the

output from MCL clustering.

実行方法

デフォルト設定で実行するにはFASTAファイルのディレクトリを指定する。入力ディレクトリは最初の引数でないといけない。

orthohmm <path_to_directory_of_FASTA_files> -o outdir

Input Directory (first argument)
A directory that contains FASTA files of protein sequences that
also have the extensions .fa, .faa, .fas, .fasta, .pep, or .prot.
OrthoHMM will automatically identify files with these extensions
and use them for analyses.
-o output directory name (default: same directory as directory of FASTA files)
-c number of parallel CPU workers to use for multithreading (default: auto detect)

出力例

各オルソグループのサマリーファイルとFASTAファイルが出力される。これらのファイルは、遺伝子ファミリーの獲得と喪失の図や系統推定など、下流の解析を容易にすることを目的としている。（論文より）

論文より

バイオインフォマティクスツールの約30%はインストールに失敗する（引用28）。OrthoHMMの長期安定性を確保するために、著者らは業界標準や著者ら自身が設計した他のソフトウェア（例えばClipKITやPhyKIT）に従ったソフトウェア設計と開発慣行に従った。具体的には、OrthoHMMコードベースは、デバッグと新機能の統合を容易にするモジュール設計を特徴としている。また、コードベースの有効性をテストするために、数多くの単体テストと統合テストを作成した。最後に、異なるPythonバージョン間でOrthoHMMの自動テストをして、期待される出力が生成されるようにする継続的インテグレーションパイプラインを実装した。これらの機能を組み合わせることで、OrthoHMMが研究コミュニティのための長年のバイオインフォマティックツールとなることを保証する。

引用

OrthoHMM: Improved Inference of Ortholog Groups using Hidden Markov Models

Jacob L Steenwyk, Thomas J. Buida, Antonis Rokas, Nicole King

bioRxiv, Posted December 12, 2024.