ppiBTEP

A Siamese (twin-branch) protein-protein interaction classifier inspired by the ESM-1b transformer architecture (Rives et al., 2021), but substantially modified and trained from scratch rather than fine-tuned from the released ESM-1b checkpoint. Also designated SiameseBTPE (BERT-Twin Protein Encoder).

Overview

ppiBTEP processes each protein independently through a shared ESM-1b-style transformer encoder -- no cross-sequence attention is used between the two proteins. Each branch extracts the [CLS] token embedding from the final transformer layer, the two embeddings are concatenated, and a dropout + linear classification head produces binary interaction predictions with softmax probabilities.

Unlike the cross-encoding approach (see ppiDCE), ppiBTEP must capture interaction-predictive features entirely from each protein's own sequence context. This makes it faster per pair and allows protein representations to be precomputed and reused, at the cost of not modeling direct inter-protein residue dependencies.

The model was developed for the Prochlorococcus marinus MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside ppiGPLM and ppiDCE) for computational PPI screening.

Architecture

Parameter	Value
Foundation	ESM-1b-inspired transformer (Rives et al., 2021) -- substantially modified, trained from scratch
Strategy	Siamese / twin-branch
Layers	12 default; 6, 8, 12, 16, or 18 selectable via --num_layers
Classification	Concat [CLS_A, CLS_B] -> Dropout(0.1) -> Linear -> 2
Max sequence length	1,024 tokens
Optimizer	AdamW (lr = 1 x 10^-5)
Loss	Cross-Entropy

Siamese vs Cross-Encoder

	ppiDCE (Cross-Encoder)	ppiBTEP (Siamese)
Input	`[CLS] Seq_A [SEP] Seq_B` (joint)	`[CLS] Seq_A` and `[CLS] Seq_B` (separate)
Cross-attention	Full bidirectional at every layer	None
Classification	Single [CLS] -> Linear	Concat [CLS_A, CLS_B] -> Linear
Complexity	O((n+m)^2)	O(n^2) + O(m^2)
Speed	Slower (joint encoding)	Faster (independent, reusable)

Installation

Prerequisites

Python 3.10+
CUDA-capable GPU (recommended)
conda (recommended) or pip

Setup

# Clone the repository
git clone https://github.com/kouroshSA/ppiBTEP.git
cd ppiBTEP

# Create a conda environment
conda create -n esm python=3.10
conda activate esm
pip install -r requirements.txt

Repository Structure

ppiBTEP/
|-- train_ppiBTPE3b.py                   # Training script
|-- inference_ppiBTPE_2GPU.py            # Batch inference script (multi-GPU)
|-- roc_analysis_color_threshold_F1e.py  # ROC curve analysis with F1 optimization
|-- assets/
|   +-- ppiBTEP.png                      # ASCII workflow diagram
|-- requirements.txt
|-- LICENSE
+-- README.md

Usage

Data Format

Training and inference use CSV files with columns: seq1, seq2, label

seq1, seq2: Amino acid sequences
label: 0 or enemies (non-interacting), 1 or friends (interacting)

For inference-only input, only the first two columns are required.

Training

# Train from scratch with 12 layers
python train_ppiBTPE3b.py \
    --train_file train.csv \
    --val_file val.csv \
    --model_config facebook/esm1b_t33_650M_UR50S \
    --num_layers 12 \
    --freeze_layers 0 \
    --epochs 20 \
    --batch_size 2 \
    --learning_rate 1e-5 \
    --max_length 1024 \
    --output_dir ./out \
    --device cuda

Key training options

--num_layers N: Total transformer layers (6, 8, 12, 16, or 18)
--freeze_layers N: Freeze bottom N layers (use 0 for training from scratch)
--checkpoint path.pth: Resume from a saved checkpoint
--model_config: ESM model config (default: facebook/esm1b_t33_650M_UR50S)

Important: When training from scratch, use --freeze_layers 0 to ensure all layers (including embeddings) remain trainable. The default is 20, which would freeze most layers.

Quick start: fetch the checkpoint from Hugging Face

The released MED4 checkpoint (checkpoints/ppiBTPE_epoch_4.pth, 12-layer) lives on this Hugging Face repo. Pull it without cloning the GitHub mirror:

from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="kouroshSA/ppiBTEP",
    filename="checkpoints/ppiBTPE_epoch_4.pth",
)
print(ckpt_path)   # pass this string to --model_path

inference_ppiBTPE_2GPU.py takes the checkpoint path as a direct --model_path argument, so no rename or specific directory layout is required — point it straight at the file you just downloaded. Use --num_layers 12 to match the architecture this checkpoint was trained with.

Input file format

The inference script expects a CSV with two columns of plain amino-acid sequences (one protein pair per row — no delimiter tokens, no length markers, no chevrons):

seq1,seq2
MKLR...QSH,MSEDF...VKN
MQAG...PIA,MTRRL...EEP

A ready-made example is shipped with the repo: MED4-PPIs-low-confidence_ppiTEPM_prompts.csv. The labeled PRS/RRS reference sets (MED4_PRS_100.csv, MED4_RRS_100.csv) include a third label column, which the inference script ignores — only the first two columns are read.

Inference

python inference_ppiBTPE_2GPU.py \
    --model_path checkpoints/ppiBTPE_epoch_4.pth \
    --model_config facebook/esm1b_t33_650M_UR50S \
    --num_layers 12 \
    --input_file MED4-PPIs-low-confidence_ppiTEPM_prompts.csv \
    --output_file predictions.csv \
    --batch_size 4 \
    --max_length 1024 \
    --device cuda

Multi-GPU inference:

python inference_ppiBTPE_2GPU.py \
    --model_path checkpoints/ppiBTPE_epoch_4.pth \
    --model_config facebook/esm1b_t33_650M_UR50S \
    --num_layers 12 \
    --input_file MED4-PPIs-low-confidence_ppiTEPM_prompts.csv \
    --output_file predictions.csv \
    --device cuda:0,1

Output CSV columns: seq1, seq2, Prediction, Probability_Friends, Probability_Enemies

ROC Analysis

Evaluate model predictions using ROC curve analysis with threshold-colored visualization and F1 optimization:

python roc_analysis_color_threshold_F1e.py \
    --input_csv probabilities.csv \
    --output_file roc_curve.png

The input CSV should have two columns: PRS (positive) and RRS (random/negative) probability values.

Architecture Diagram

The ASCII workflow diagram (assets/ppiBTEP.png) covers:

A. Siamese input strategy (independent per-protein encoding)
B. Model architecture (twin ESM-1b-style branches + concat classification head)
C. Training pipeline
D. Inference pipeline (multi-GPU)

Note: the diagram shows Softmax in the classification head for clarity, but the implementation returns raw logits — softmax is applied implicitly by CrossEntropyLoss during training and explicitly during inference.

Citation

If you use this software, please cite:

Daakour, S. et al. (2026).

License

This project is licensed under the MIT License. See LICENSE for details.

Downloads last month: -; Downloads are not tracked for this model. How to track