複数のプロファイルHMMを1つに統合する HMMerge

　過去数十年の間に多重配列アライメントのための手法開発が進歩したにもかかわらず、配列の長さが大きく異なるデータセットのアライメントは、特に入力配列に非常に短い配列（シークエンシング技術、または進化の過程で大きく欠失した配列）が含まれる場合、まだ十分に解決されていない問題である。

HMMergeは、配列長の不均一性が高いデータセットのアラインメントを計算する方法であり、与えられた「バックボーン」アラインメントに短い配列を追加する方法である。HMMergeは、その前身であるアラインメント手法UPPとWITCHの技術を基に構築されており、プロファイルHMMのアンサンブルを構築してバックボーンアラインメントを表現し、そのアンサンブルを用いて残りの配列をバックボーンアラインメントに追加する。HMMergeはUPPやWITCHとは異なり、アンサンブルから新しい「マージ」HMMを構築し、そのマージHMMを使ってクエリー配列をアライメントする。HMMergeはWITCHと競合し、非常に短い配列をバックボーンアラインメントに追加する際にWITCHより優位であることを示す。

インストール

condaを使っているので、condaで環境を作って依存を導入した。

依存

biopython
click
dendropy (if using a backbone tree)
numpy
pyhmmer-sepp
pytest (for testing)
scipy

Github

https://github.com/MinhyukPark/HMMerge

mamba create -n HMMerge python=3 -y
conda activate HMMerge 
pip install biopython click dendropy numpy pyhmmer-sepp pytest scipy pathos

#pyhmmer-seppはpyhmmerのフォーク。オリジナルのpyhmmerなら以下の通り導入可能
mamba install -c bioconda pyhmmer

#例ではtrimAIを使用している
mamba install -c bioconda trimal 

#本体
git clone https://github.com/MinhyukPark/HMMerge.git
cd HMMerge/

python main.py --help

Usage: main.py [OPTIONS]

Options:

--input-dir PATH The input temp root dir of sepp that

contains all the HMMs [required]

--backbone-alignment PATH The input backbone alignment [required]

--query-sequence-file PATH The input query sequence file [required]

--output-prefix PATH Output prefix [required]

--input-type [custom|sepp|upp] The type of input

--num-processes INTEGER Number of Processes

--support-value FLOAT RANGE the weigt support of Top HMMs to choose for

merge, 1.0 for all HMMs [0.0<=x<=1.0]

--equal-probabilities BOOLEAN Whether to have equal enty/exit

probabilities

--model [DNA|RNA|amino] DNA, RNA, or amino acid analysis [required]

--output-format [FASTA|A3M] FASTA or A3M format for the output alignment

[required]

--debug Whether to run in debug mode or not

--verbose Whether to run in verbose mode or not

--help Show this message and exit.

テスト

cd HMMerge/
pytest test.py

実行方法

ランするには、FASTA形式のバックボーンアライメント、NEWICK形式のバックボーンツリー、アラインメントするFASTA形式の配列（MSAをdecompose(分解)したもの）が必要。

python main.py --input-dir decomposed_alignments_dir/ --backbone-alignment backbone_alignment --query-sequence-file Query_sequences --output-prefix outdir --num-processes 10 --model dna

--input-dir The input temp root dir of sepp that contains all the HMMs [required]
--backbone-alignment The input backbone alignment [required]
--query-sequence-file The input query sequence file [required]
--model [DNA|RNA|amino] DNA, RNA, or amino acid analysis