AlphaFastPPi - macでインフォマティクス

2024/10/31追記、論文引用、11/02追記

　プロテオーム全体にわたる新しいタンパク質間相互作用（PPI）を発見することは、新しいタンパク質の機能を理解し、生物内あるいは生物間のシステム特性を解明する上で大きな可能性をもたらす。近年の計算構造生物学、特にAlphaFold-Multimerの進歩により、このタスクは容易になったが、大規模スクリーニングのためのスケーリングは依然として課題であり、多大な計算資源を必要とする。

本著者らは、AlphaFold-Multimerによって生成されるモデルの数を5つから1つに減らすことが、この手法の真のPPIと偽のPPIを区別する能力に与える影響を評価した。細菌および真核生物由来のタンパク質を含む、種内および種間のPPIを含むデータセットを用いて評価を行った。サンプリングを減らしても手法の精度は損なわれず、PPI予測のために5倍速く、効率的で、環境に優しいソリューションを提供することを実証した。この論文で使用したコードはhttps://github.com/MIDIfactory/AlphaFastPPiにある。最新版のAlphaPulldown（https://github.com/KosinskiLab/AlphaPulldown）でも同様のことが可能である。

AlphaFastPPiは、AlphaFold-Multimerを用いた大規模なタンパク質間相互作用解析を効率化するために設計されたPythonパッケージである。テストされた各タンパク質の組み合わせに対して、AlphaFastPPiは単一のモデルを返す。最近のバージョンのAlphaPulldownを使っても同じ結果を得ることができる。AlphaPulldown v.1.0.4のpulldownバージョンで単一のモデルを得るには、run_multimer_job.pyを実行する際に--num_predictions_per_model=1, --model_names=model_1_multimer_v3, --num_cycle=1, --nopair_msオプションを使用する。

データベース

AlphaFastPPiにはAlphafoldデータベースが必要。フルサイズ（～2.62 TB）と縮小サイズ（～820 GB）の2種類がある。

git clone https://github.com/kalininalab/alphafold_non_docker
cd alphafold_non_docker
./download_db.sh -d AF2_DB_dir

フルサイズのダウンロードには土曜夜から1日半を要した。

> ls -lh --color=auto

インストール

マニュアル通りpython3.10の環境を作って導入した。

Github

mamba create -n AlphaFastPPi -c omnia -c bioconda -c conda-forge python==3.10 openmm==8.0 pdbfixer==1.9 kalign2 cctbx-base pytest importlib_metadata hhsuite
conda activate AlphaFastPPi
python3 -m pip install alphapulldown==1.0.4

pip install jax==0.4.23 jaxlib==0.4.23 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
mamba install -c anaconda cudnn -y #ない場合だけ 
mamba install -c nvidia -c conda-forge cuda-toolkit==11 -y #ない場合だけ "nvcc -V"
mamba install -c nvidia cuda-nvcc -y #ない場合だけ


#本体
git clone https://github.com/MIDIfactory/AlphaFastPPi.git
cd AlphaFastPPi/
create_individual_features.py --helpfull

> python3 AlphaFastPPi.py --help

flags:

AlphaFastPPi.py:

-b,--bait_list: File containing a list of the names of the bait proteins (should have the same names used for the msa) [REQUIRED in pulldown mode]

(a comma separated list)

-d,--data_dir: Path to Alphafold databases directory [REQUIRED]

--mode: <pulldown|all_vs_all>: Choose the mode of running multimer jobs [REQUIRED]

(default: 'pulldown')

-m,--monomer_objects_dir: A list of directories where monomer objects are stored [REQUIRED]

(a comma separated list)

--[no]no_pair_msa: do not pair the MSAs when constructing multimer objects

(default: 'true')

-o,--output_path: Folder where the data will be stored [REQUIRED]

-l,--protein_list: File containing a list of the names of the proteins [REQUIRED]

(a comma separated list)

--[no]relaxation_step: Enable final relaxation step

(default: 'false')

-n,--seq_index: Index (number) of sequence in the fasta file to start from

(an integer)

Try --helpfull to get a list of all flags.

AlphaFastPPiは2つの異なるモードをサポートしている。

pulldown: タンパク質リストbaitsと他のタンパク質リストcandidateを比較する。
all_vs_all: タンパク質リストのすべてのペアをモデル化する。
AlphaPulldownを使ってMSAを作成し、各タンパク質について必要な特徴を計算し、保存する。

テストラン

bait.fasta（１配列）とcandidates.fasta（３配列）の相互作用をそれぞれのペアで調べる。

1、AlphaPulldown - クエリの各タンパク質配列についてHMMERを使用してプリインストールされたデータベースを検索し、見つかったすべてのホモログについてマルチプルシーケンスアラインメント（MSA）が計算される。また、特徴生成のテンプレートとなるホモログ構造も検索され、結果はモノマー特徴ファイル.pklファイルに格納される。このステップは CPU だけを必要とする。bait.fasta、candidates.fasta両方とも指定する。

create_individual_features.py\
 --fasta_paths=bait.fasta,candidates.fasta\
 --data_dir=/mnt/AlphaFoldDB\
 --output_dir=outdir\
 --max_template_date=2024-11-01\
 --use_mmseqs2=True

#あるいは"alphapulldown"のdocker imageを使う
#-vでAF2 DBも共有してランする
docker run -itv $PWD:/data --rm -v /path/to/AF2_DB_dir:/AF2_DB  gallardoalba/alphapulldown:0.30.7
create_individual_features.py --help

create_individual_features.py --fasta_paths=bait.fasta,candidates.fasta --data_dir=/AF2_DB --output_dir=/data/outdir --max_template_date=2024-10-30 --use_mmseqs2=True

#同じ出力ディレクトリを指定して順番にランしてもOK
create_individual_features.py --fasta_paths=bait.fasta --data_dir=/AF2_DB --output_dir=/data/outdir --max_template_date=2024-10-30 --use_mmseqs2=True

create_individual_features.py --fasta_paths=candidates.fasta --data_dir=/AF2_DB --output_dir=/data/outdir --max_template_date=2024-10-30 --use_mmseqs2=True

--fasta_paths single fasta file containing all the sequences to include in the analysis or several fasta files separated by comma (e.g. --fasta_paths=protein_A.fasta, protein_B.fasta). FASTA file should not contain any special characters (such as |, :, ;, #) or spaces.
--use_mmseqs2 when set to "True," mmseqs is executed remotely, which is a quick option and typically takes a few minutes per protein. Alternatively, you can set it to "False" to use HHblits locally.

出力例（--use_mmseqs2=False）

baitの配列名はMPN394、candidatesの配列はMPN263、MPN066、Q0VC48

出力例（--use_mmseqs2=True）

2、AlphaFastPPi - pulldownモードのラン。最初のステップで生成されたpickleファイル(.pkl)が必要。さらに、予測したいタンパク質の組み合わせのリストが必要。具体的にはbaitの配列名のリストとcandidatesの配列名リストをそれぞれのオプションで指定し、AlphaPulldownの出力（ここではoutdir）と出力を指定する。

python3 AlphaFastPPi.py --mode pulldown -l candidates.txt -b bait.txt -d AF2_DB/ -m outdir -o prediction

--mode Choose the mode of running multimer jobs [REQUIRED] (default: 'pulldown')
-b File containing a list of the names of the bait proteins (should have the same names used for the msa) [REQUIRED in pulldown mode]
-d Path to Alphafold databases directory
-m A list of directories where monomer objects are stored [REQUIRED] (a comma separated list)
-o Folder where the data will be stored [REQUIRED]
-l File containing a list of the names of the proteins [REQUIRED] (a comma separated list)

３ペアについて、GPU有効だと10分程度（RTX3090）、CPU（5995WX）だと１－２時間かかった。

出力

各タンパク質-タンパク質の組み合わせについて、proteinA_and_proteinBというサブディレクトリが作成され、その中に以下のファイルが保存される。

.pdb形式のモデル
対応する.pklファイル
timings.jsonファイル

> ls -alth --color=auto prediction/

>ls -alth --color=auto prediction/MPN394_and_Q0VC48/

さらにカレントのパスにnamed output_name.tsvが保存される。

テーブルには以下のメトリクスが含まれる。

pDockQ
ipTM
ipTM+pTM
平均plDDT

配列名に特殊文字やピリオド（WP_xxxxxxx.1）などが含まれるとAlphaFastPPi.pyのランでエラーが起きるようなので注意してください。タンパク質のfastaはone-lineでなくても問題は起きません（配列途中で改行があってもOK）。

AlphaFastPPiとAlphaPulldownの論文より

以前の知見（Bryant et al.2022）と一致して、pDockQはipTMよりも優れた結果を示した。pDockQの閾値を0.5に上げると、感度は10%を下回ったものの、特異度は90%を上回った。
ipTMとpDockQの両方で、5モデルと1モデルのサンプリングを比較するためにDeLongの検定を実行し、AUCの間に統計的に有意な差はないことを示した
モデルの質を効果的に高めることが実証されているマッシブ・サンプリング（Wallner 2023）は、相互作用するタンパク質ペアと相互作用しないタンパク質ペアを区別する手法の能力も高めると思われるが、この方法は計算量が多いため大規模な研究には適用できない。
二量体のpDockQスコア（predicted DockQ score）はタンパク質間の相互作用に関する信頼度スコアを示している。pDockQで高い信頼性で許容可能なモデル（DockQ≥0.23）と不正確なモデルを区別できる（Patrick Bryant et al, 2022）。

引用

Accelerating Protein-Protein Interaction screens with reduced AlphaFold-Multimer sampling

G. Bellinzona, D. Sassera, A.M.J.J Bonvin

bioRxiv, Posted July 05, 2024.

Accelerating protein–protein interaction screens with reduced AlphaFold-Multimer sampling
Greta Bellinzona, Davide Sassera, Alexandre M J J Bonvin
Bioinformatics Advances, Volume 4, Issue 1, 2024, vbae153

AlphaPulldown-a python package for protein-protein interaction screens using AlphaFold-Multimer

Dingquan Yu, Grzegorz Chojnowski, Maria Rosenthal, Jan Kosinski

Bioinformatics. 2023 Jan 1;39(1):btac749.