PDeepPP model

PDeepPP is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from ESM and incorporating both transformer and convolutional neural network (CNN) architectures, PDeepPP provides a robust framework for analyzing protein sequences in various contexts.

Model description

PDeepPP is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:

A Self-Attention Global Features module for capturing long-range dependencies.
A TransConv1d module, combining transformers and convolutional layers.
A PosCNN module, which applies position-aware convolutional operations for feature extraction.

The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's transformers library, allowing seamless integration with other tools and workflows.

Intended uses

PDeepPP was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, PDeepPP can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:

PTM datasets: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
BPS datasets: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.

Although the model was trained and validated on PTM and BPS datasets, PDeepPP’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.

Key features

Dataset support: PDeepPP is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
Task flexibility: The model is not limited to PTM and BPS tasks. Users can adapt PDeepPP to other protein sequence-based tasks by customizing input data and task objectives.
PTM mode: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
BPS mode: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.

How to use

To use PDeepPP, you need to install the required dependencies, including torch and transformers:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers

Before proceeding, you need to ensure that the DataProcessor and Pretraining files are in the same directory as the example file. Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:

import torch
import esm
from DataProcessor_pdeeppp import PDeepPPProcessor
from Pretraining_pdeeppp import PretrainingPDeepPP
from transformers import AutoModel

# Global parameter settings
device = torch.device("cpu")
pad_char = "X"  # Padding character
target_length = 33  # Target length for sequence padding
mode = "BPS"  # Mode setting (only configured in example.py)
esm_ratio = 0.95  # Ratio for ESM embeddings

# Load the PDeepPP model
model_name = "fondress/PDeepPP_umami"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)  # Directly load the model

# Initialize the PDeepPPProcessor
processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)

# Example protein sequences (test sequences)
protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]

# Preprocess the sequences
inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt")  # Dynamic mode parameter
processed_sequences = inputs["raw_sequences"]

# Load the ESM model
esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
esm_model = esm_model.to(device)
esm_model.eval()

# Initialize the PretrainingPDeepPP module
pretrainer = PretrainingPDeepPP(
    embedding_dim=1280, 
    target_length=target_length, 
    esm_ratio=esm_ratio, 
    device=device
)

# Extract the vocabulary and ensure the padding character 'X' is included
vocab = set("".join(protein_sequences))
vocab.add(pad_char)  # Add the padding character

# Generate pretrained features using the PretrainingPDeepPP module
pretrained_features = pretrainer.create_embeddings(
    processed_sequences, vocab, esm_model, esm_alphabet
)

# Ensure pretrained features are on the same device
inputs["input_embeds"] = pretrained_features.to(device)

# Perform prediction
model.eval()
outputs = model(input_embeds=inputs["input_embeds"])  # Use pretrained features as model input
logits = outputs["logits"]

# Compute probability distributions and generate predictions
softmax = torch.nn.Softmax(dim=-1)  # Apply softmax on the last dimension
probabilities = softmax(logits)
predicted_labels = (probabilities >= 0.5).long()

# Print the prediction results for each sequence
print("\nPrediction Results:")
for i, seq in enumerate(processed_sequences):
    print(f"Sequence: {seq}")
    print(f"Probability: {probabilities[i].item():.4f}")
    print(f"Predicted Label: {predicted_labels[i].item()}")
    print("-" * 50)

Training and customization

PDeepPP supports fine-tuning on custom datasets. The model uses a configuration class (PDeepPPConfig) to specify hyperparameters such as:

Number of transformer layers
Hidden layer size
Dropout rate
PTM type and other task-specific parameters

Refer to PDeepPPConfig for details.

Citation

If you use PDeepPP in your research, please cite the associated paper or repository:

@article{your_reference,
  title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
  author={Author Name},
  journal={Journal Name},
  year={2025}
}