README.md · AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp1 at 61870326226de898fbdcc2a8a21afabf4dfe6347

metadata

library_name: peft
license: mit
datasets:
  - AmelieSchreiber/binding_sites_random_split_by_family_550K
metrics:
  - accuracy
  - f1
  - roc_auc
  - precision
  - recall
  - matthews_correlation

Training procedure

This model was finetuned on ~549K protein sequences from the UniProt database. The dataset can be found here. The model obtains the following test metrics:

Epoch  Training Loss  Validation Loss Accuracy  Precision  Recall	 F1	      Auc	   Mcc
1	   0.037400	      0.301413        0.939431	0.366282   0.833003	 0.508826 0.888300 0.528311

The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric. We used Hugging Face's parameter efficient finetuning (PEFT) library to finetune with Low Rank Adaptation (LoRA). We decided to use a rank of 2 for the LoRA, as this was shown to slightly improve the test metrics compared to rank 8 and rank 16 on the same model trained on the smaller dataset.

Framework versions

PEFT 0.5.0

Using the model

To use the model on one of your protein sequences try running the following:

from transformers import AutoModelForTokenClassification, AutoTokenizer
from peft import PeftModel
import torch

# Path to the saved LoRA model
model_path = "AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp1"
# ESM2 base model
base_model_path = "facebook/esm2_t12_35M_UR50D"

# Load the model
base_model = AutoModelForTokenClassification.from_pretrained(base_model_path)
loaded_model = PeftModel.from_pretrained(base_model, model_path)

# Ensure the model is in evaluation mode
loaded_model.eval()

# Load the tokenizer
loaded_tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Protein sequence for inference
protein_sequence = "MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT"  # Replace with your actual sequence

# Tokenize the sequence
inputs = loaded_tokenizer(protein_sequence, return_tensors="pt", truncation=True, max_length=1024, padding='max_length')

# Run the model
with torch.no_grad():
    logits = loaded_model(**inputs).logits

# Get predictions
tokens = loaded_tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])  # Convert input ids back to tokens
predictions = torch.argmax(logits, dim=2)

# Define labels
id2label = {
    0: "No binding site",
    1: "Binding site"
}

# Print the predicted labels for each token
for token, prediction in zip(tokens, predictions[0].numpy()):
    if token not in ['<pad>', '<cls>', '<eos>']:
        print((token, id2label[prediction]))