ppiDCE
A dual cross-encoder for binary protein-protein interaction (PPI) classification, inspired by the ESM-1b transformer architecture (Rives et al., 2021) but substantially modified and trained from scratch rather than fine-tuned from the released ESM-1b checkpoint.
Overview
ppiDCE adapts the ESM-1b transformer architecture -- a single-sequence masked language model with no native PPI capability -- for protein-protein interaction prediction by exploiting the tokenizer's sentence-pair encoding mode and training the modified model from scratch on PPI data. Both protein sequences are concatenated into a single input as [CLS] Seq_A [SEP] Seq_B [EOS], enabling full bidirectional cross-attention between the two sequences at every transformer layer. The [CLS] token representation from the final layer captures joint inter-protein features and is passed through a dropout + linear classification head to produce binary interaction predictions with softmax probabilities.
The model was developed for the Prochlorococcus marinus MED4 interactome, where it serves as one component of a tri-model consensus framework (alongside ppiGPLM and ppiBTEP) for computational PPI screening.
Architecture
| Parameter | Value |
|---|---|
| Foundation | ESM-1b-inspired transformer (Rives et al., 2021) -- substantially modified, trained from scratch |
| Strategy | Cross-encoding (sentence-pair) |
| Layers | 12 (configurable) |
| Classification | [CLS] -> Dropout(0.1) -> Linear -> 2 |
| Max sequence length | 1,024 tokens |
| Optimizer | AdamW (lr = 2 x 10^-5) |
| Loss | Cross-Entropy |
Cross-Encoding vs Single-Sequence
Unlike the original ESM-1b which processes one protein at a time, ppiDCE feeds both proteins as a single concatenated input. This enables inter-protein residue-residue attention at every transformer layer -- the most expressive strategy for modeling pairwise interactions, at the cost of O((n+m)^2) attention complexity.
Installation
Prerequisites
- Python 3.10+
- CUDA-capable GPU (recommended)
- conda (recommended) or pip
Setup
# Clone the repository
git clone https://github.com/kouroshSA/ppiDCE.git
cd ppiDCE
# Create a conda environment
conda create -n esm python=3.10
conda activate esm
pip install -r requirements.txt
Repository Structure
ppiDCE/
|-- train_ppiDCE.py # Training script
|-- inference_ppiDCE.py # Batch inference script
|-- roc_analysis_color_threshold_F1e.py # ROC curve analysis with F1 optimization
|-- assets/
| +-- ppiDCE.png # ASCII workflow diagram
|-- requirements.txt
|-- LICENSE
+-- README.md
Usage
Data Format
Training and inference use CSV files with columns: protein1_seq, protein2_seq, label
protein1_seq,protein2_seq: Amino acid sequenceslabel:0(non-interacting) or1(interacting)
For inference-only input, only the first two columns are required.
Training
# Train from scratch with 12 layers
python train_ppiDCE.py \
--train_file train.csv \
--val_file val.csv \
--model_config facebook/esm1b_t33_650M_UR50S \
--from_scratch \
--num_layers 12 \
--epochs 10 \
--batch_size 2 \
--learning_rate 2e-5 \
--max_length 1024 \
--output_dir ./out \
--device cuda
Key training options
--from_scratch: Initialize the ESM backbone with random weights instead of loading pretrained ESM-1b. Useful when you suspect single-sequence pretraining priors are inappropriate for your task.--num_layers N: Set total transformer layers when training from scratch--freeze_layers N: Freeze bottom N layers during fine-tuning--add_layers N: Append extra transformer layers on top--checkpoint path.pth: Resume from a saved checkpoint--suppress_warnings: Suppress tokenizer truncation warnings
Quick start: fetch the checkpoint from Hugging Face
The released MED4 checkpoint (checkpoints/ppiDCE_epoch8.pth, 12-layer)
lives on this Hugging Face repo. Pull it without cloning the GitHub mirror:
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="kouroshSA/ppiDCE",
filename="checkpoints/ppiDCE_epoch8.pth",
)
print(ckpt_path) # pass this string to --model_path
inference_ppiDCE.py takes the checkpoint path as a direct --model_path
argument, so no rename or specific directory layout is required โ point
it straight at the file you just downloaded.
Input file format
The inference script expects a CSV with two columns of plain amino-acid sequences (one protein pair per row โ no delimiter tokens, no length markers, no chevrons):
seq1,seq2
MKLR...QSH,MSEDF...VKN
MQAG...PIA,MTRRL...EEP
A ready-made example is shipped with the repo:
MED4-PPIs-low-confidence_ppiTEPM_prompts.csv.
The labeled PRS/RRS reference sets (MED4_PRS_100.csv, MED4_RRS_100.csv)
include a third label column, which the inference script ignores โ only
the first two columns are read.
Inference
python inference_ppiDCE.py \
--model_path checkpoints/ppiDCE_epoch8.pth \
--model_config facebook/esm1b_t33_650M_UR50S \
--input_file MED4-PPIs-low-confidence_ppiTEPM_prompts.csv \
--output_file predictions.csv \
--batch_size 4 \
--max_length 1024 \
--device cuda
Output CSV columns: seq1, seq2, pred_label, prob_0, prob_1
ROC Analysis
Evaluate model predictions using ROC curve analysis with threshold-colored visualization and F1 optimization:
python roc_analysis_color_threshold_F1e.py \
--input_csv probabilities.csv \
--output_file roc_curve.png
The input CSV should have two columns: PRS (positive) and RRS (random/negative) probability values.
Architecture Diagram
The ASCII workflow diagram (assets/ppiDCE.png) covers:
- A. Cross-encoding input strategy
- B. Model architecture (ESM-1b-style backbone + classification head)
- C. Training pipeline
- D. Inference pipeline
Note: the diagram shows Softmax in the classification head for clarity, but the implementation returns raw logits โ softmax is applied implicitly by CrossEntropyLoss during training and explicitly during inference.
Citation
If you use this software, please cite:
Daakour, S. et al. (2026).
License
This project is licensed under the MIT License. See LICENSE for details.
