metadata

license: mit

ESM-2 QLoRA for Binding Sites Prediction

In this model we added in more QLoRA adapter layers, modifying all of the weight matrices with QLoRA. The differences between the train and test metrics, again, are smaller for this model than for the model with fewer adapter layers (only using query, key, and value matrices). So, we see that adapting more of the weight matrices in this larger ESM-2 model decreases overfitting and serves as a better regularizer. For comparison, see this model which only has QLoRA adapters on the query, key, and value matrices. This model was trained on this dataset. Note, this dataset is too small for this model, so overfitting is expected, but overfitting is clearly reduced by including more adapter layers in the QLoRA.

Testing for Overfitting

Train metrics:
{'eval_loss': 0.17861589789390564,
'eval_accuracy': 0.9336392007583741,
'eval_precision': 0.24007189695313816,
'eval_recall': 0.9234520216135872,
'eval_f1': 0.38107489676203077,
'eval_auc': 0.9286608447868842,
'eval_mcc': 0.4519203165484902}

Test metrics:
{'eval_loss': 0.2265990674495697,
'eval_accuracy': 0.913988661430497,
'eval_precision': 0.1725452162312655,
'eval_recall': 0.8272126203209694,
'eval_f1': 0.28553230637278637,
'eval_auc': 0.8715212375759034,
'eval_mcc': 0.3539008454498742

To use this model, run the following:

!pip install transformers -q
!pip install peft -q

Then run:

from transformers import AutoModelForTokenClassification, AutoTokenizer
from peft import PeftModel
import torch

# Path to the saved LoRA model
model_path = "AmelieSchreiber/esm2_t12_35M_qlora_binding_sites_v1"
# ESM2 base model
base_model_path = "facebook/esm2_t12_35M_UR50D"

# Load the model
base_model = AutoModelForTokenClassification.from_pretrained(base_model_path)
loaded_model = PeftModel.from_pretrained(base_model, model_path)

# Ensure the model is in evaluation mode
loaded_model.eval()

# Load the tokenizer
loaded_tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Protein sequence for inference
protein_sequence = "MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT"  # Replace with your actual sequence

# Tokenize the sequence
inputs = loaded_tokenizer(protein_sequence, return_tensors="pt", truncation=True, max_length=1024, padding='max_length')

# Run the model
with torch.no_grad():
    logits = loaded_model(**inputs).logits

# Get predictions
tokens = loaded_tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])  # Convert input ids back to tokens
predictions = torch.argmax(logits, dim=2)

# Define labels
id2label = {
    0: "No binding site",
    1: "Binding site"
}

# Print the predicted labels for each token
for token, prediction in zip(tokens, predictions[0].numpy()):
    if token not in ['<pad>', '<cls>', '<eos>']:
        print((token, id2label[prediction]))