タンパク質配列と構造のハイブリッド類似性スコアに基づく進化推定のためのツール PC_ali

　進化的推論は複数配列アラインメント(MSA)の質に大きく依存するが、遠縁のタンパク質では問題がある。タンパク質の構造は塩基配列よりも保存されているので、遠いホモログに対して構造アラインメントを用いるのは自然なことのように思われる。しかしながら、構造アラインメントは進化的関係を推測するには適していないかもしれない。本著者らは、配列と構造に依存する4つのタンパク質類似性尺度（整列残基の割合、配列同一性、重ね合わせ残基の割合、接触重なり）を検討した。これらの主成分に基づく新しいハイブリッドタンパク質配列・構造類似度スコアPC_simを提案する。対応する乖離指標PC_divは、個々の類似性から得られる乖離と最も強い相関を示し、正確な進化的乖離を推論することを示唆する。PC_simに基づく類似度行列を用いて、タンパク質のMSAをde novoまたは入力MSAを修正して構築するプログラムPC_aliを開発した。このプログラムは、PAのグラフの最大閥に基づいて開始MSAを構築し、PC_divで再構築された木に沿った漸進的なアラインメントによってMSAを改良する。PC_aliは、8つの最新の多重構造アライメントツールや配列アライメントツールと比較して、より高い、あるいは同等のアライメント率と構造を達成している。

インストール

windows11のWSL環境でビルドした。

Github

https://github.com/ugobas/PC_ali

git clone https://github.com/ugobas/PC_ali.git
cd PC_ali/
make

> ./PC_ali

Starting ./PC_ali

help of program ./PC_ali

Author Ugo Bastolla Centro de Biologia Molecular Severo Ochoa (CSIC-UAM), Madrid, Spain

Email: <ubastolla@cbm.csic.es>

PC_ali performs hybrid multiple structure and sequence alignmentsbased on the structure+sequence similarity score PC_sim, prints pairwise similarity scores and divergence scores and neighbor-joining phylogenetic tree obtained with the hybrid evolutionary divergence measure based on PC_sim. Optionally, it computes violations of the molecular clock for each pair of proteins.

It takes as input either not aligned sequences (option -seq) or MSA (option -ali). PDB file names must be specified as sequence name

It includes a modification of the needlemanwunsch aligner programmed by Dr. Andrew C. R. Martin in the Profit suite of programs, (c) SciTech Software 1993-2007

Usage:

PC_ali -seq <sequences in FASTA format, with names of PDB files>

-ali <MSA file in FASTA format, with names of PDB files>

# The pdb code is optionally followed by the chain index

# Ex: >1opd.pdb A or >1opdA or >1opd_A

-pbdir <directory of pdb files> (default: current directory)

-pdbext <extension of pdb files> (default: none)

Computed similarity measures:

(1) Aligned fraction ali,

(2) Sequence identity SI,

(3) Contact overlap CO,

(4) TM-score TM (Zhang & Skolnick Proteins 2004 57:702)

(5) PC_sim, based on the main Principal Component of the four above similarity scores

They are printed in <>.prot.sim for all pairs of protein sequences, and also for multiple conformations of the same sequence (if present) if required with -print_sim

Computed divergence measures:

(1) Tajima-Nei divergence TN=-log*1 with S0=0.06 (Tajima F & Nei 1984, Mol Biol Evol 1:269),

(2) Contact_divergence CD=-log*2/(1-q0(L)) (Pascual-Garcia et al Proteins 2010 78:181-96),

(3) TM_divergence=-log*3, TM0=0.167.

(4) PC_divergence=-log*4, PC0 linear combination of S0, TM0, CO(L) and nali0=0.5.

They are printed in <>.prot.div for all pairs of protein sequences, and also for multiple conformations of the same sequence (if present) if required with -print_div

Flux of the program:

(1) In the modality -ali, the program starts from the pairwise alignments obtained from the input MSA. In the modality -seq the starting pairwise alignments are built internally.

(2) The program then modifies the pairwise alignments by targeting PC_sim. The similarity matrix is constructed recursively, using the input pairwise alignment for computing the shared contacts and the distance after optimal superimposition (maximizing the TM score) for all pairs of residues and obtaining a new alignments. Two iterations are usually enough for getting good results. Optionally, for the sake of comparison, the program can target the TM score (-ali_tm), the Contact Overlap (-ali_co) and the secondary structure superposition (-ali_ss).

(3) Then, the program builds the multiple alignment based on the maximal cliques of the pairwise alignments. This computation does not require neither a guide tree nor gap penalty parameters and in most cases it is faster than the progressive multiple alignment.

(4) Finally, the program runs iteratively progressive multiple alignments using as guide tree the average linkage tree obtained with the PC_Div divergence measure of the previous step and using as starting alignment the previous multiple alignment. The best MSA is selected as the one with the maximum value of the average PC similarity score.

(5) The program prints the optimal MSA and the Neighbor Joining tree obtained from the corresponding PC_Div divergence measure.

(6) Optionally, if -print_pdb is set, the program prints the multiple superimposition obtained by maximizing the TM score

(7) Furthermore, if -print_cv is set, the program computes and prints for all four divergence measures the violations of the molecular clock averaged over all possible outgroups identified with the Neighbor-Joining criterion, and the corresponding significance score.

In the first pairwise phase, the program computes similarity and divergence scores for all pairs of protein structures. It then clusters all conformations of the same protein and computes the structural similarity (divergence) between two proteins as the maximum (minimum) across all the examined conformations.

The similarity and divergence scores are computed for the starting alignment, for the modified pairwise alignments that target different similarity scores (TM score, contact overlap and PC_sim) and for the best multiple alignment.

COMPILE:

>unzip PC_ali.zip

>make

>cp PC_ali ~/bin/ (or whatever path directory you like)

RUN:

>PC_ali -seq <sequence file> -pdbdir <path to PDB files>

EXAMPLE: PC_ali -seq 50044_Mammoth.aln -pdbdir <PDBPATH>

(all PDB files named in 50044_Mammoth.aln must be in <PDBPATH>)

OUTPUT (for each pair of proteins):

-------

MSA (.msa),

NJ tree (.tree),

structure similarity (.sim) and structure divergence scores (.div) for each protein pair,

correlations between different types of sequence and structure identity (.id), MSA of secondary structure (.ss.msa)

-------

==========================================================

Options:

-seq <sequences in FASTA format, with names of PDB files>

-ali <MSA file in FASTA format, with names of PDB files>

# The pdb code is optionally followed by the chain index

# Ex: >1opd.pdb A or >1opdA or >1opd_A

-pbdir <directory of pdb files> (default: current directory)

-pdbext <extension of pdb files> (default: none)

#### Optional parameters:

-out <Name of output files> (default: alignment file)

-ali_tm ! Make pairwise alignments that target TM score

-ali_co ! Make pairwise alignments that target Cont Overlap

-ali_ss ! Make alignments that target sec.structure

-ss_mult ! target sec.structure with multiple alignment

-shift_max <Maximum shift for targeting sec.str.>

-print_pdb ! Print multiple structure superimposition

-print_sim ! Print similarity measures for all pairs

-print_div ! Print divergence measures for all pairs

-print_cv ! Print clock violations

-func <file with function similarity for pairs of proteins>

テストラン

PC_aliは、構造+配列の類似度スコアPC_simに基づき、ハイブリッド多重構造と配列のアラインメントを行う。MSAの他に、PC_simに基づくハイブリッド進化乖離指標で得られたペアワイズ類似度スコア、乖離スコア、近傍結合系統樹を出力し、多重重畳構造を持つpdbファイルを出力する。オプションとして、各タンパク質のペアについて分子時計の誤差を計算する。

MSAファイルを指定する。MSA中の配列ファイルについて、PDBファイルがカレントにないといけない。テストデータの50044_Mammoth.alnには16配列含まれ、そのPDBファイルもれレポジトリのルートに配置されている。

./PC_ali -seq 50044_Mammoth.aln -pdbdir ./

MSA(PCAli.fas)、NJ tree (PCAli.tree)、各タンパク質ペアの構造類似度(.sim)と構造乖離スコア(.div)、異なるタイプの配列と構造の同一性の相関(.id)、二次構造のMSA(_ss.msa)、が出力される。

出力

引用

PC_ali: a tool for improved multiple alignments and evolutionary inference based on a hybrid protein sequence and structure similarity score
Ugo Bastolla, David Abia, Oscar Piette Author Notes
Bioinformatics, Volume 39, Issue 11, November 2023

*1:SI-S0)/(1-S0

*2:q-q0(L

*3:TM-TM0)/(1-TM0

*4:PC-PC0)/(1-PC0