--- language: en tags: - transformers - protein - peptide-receptor license: apache-2.0 datasets: - custom --- ## Model Description This model predicts receptor classes, identified by their PDB IDs, from peptide sequences using the [ESM2](https://huggingface.co/docs/transformers/model_doc/esm) (Evolutionary Scale Modeling) protein language model with esm2_t6_8M_UR50D pre-trained weights. The model is fine-tuned for receptor prediction using datasets from [PROPEDIA](http://bioinfo.dcc.ufmg.br/propedia2/) and [PepNN](https://www.nature.com/articles/s42003-022-03445-2), as well as novel peptides experimentally validated to bind to their target proteins, with binding conformations determined using ClusPro, a protein-protein docking tool. The name `pep2rec_cppp` reflects the model's ability to predict peptide-to-receptor relationships, leveraging training data from ClusPro, PROPEDIA, and PepNN. It's particularly useful for researchers and practitioners in bioinformatics, drug discovery, and related fields, aiming to understand or predict peptide-receptor interactions. ## How to Use Here is how to predict the receptor class for a peptide sequence using this model: ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer from joblib import load MODEL_PATH = "littleworth/esm2_t6_8M_UR50D_pep2rec_cppp" model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH) tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) LABEL_ENCODER_PATH = f"{MODEL_PATH}/label_encoder.joblib" label_encoder = load(LABEL_ENCODER_PATH) input_sequence = "GNLIVVGRVIMS" inputs = tokenizer(input_sequence, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) probabilities = torch.softmax(outputs.logits, dim=1) predicted_class_idx = probabilities.argmax(dim=1).item() predicted_class = label_encoder.inverse_transform([predicted_class_idx])[0] class_probabilities = probabilities.squeeze().tolist() class_labels = label_encoder.inverse_transform(range(len(class_probabilities))) sorted_indices = torch.argsort(probabilities, descending=True).squeeze() sorted_class_labels = [class_labels[i] for i in sorted_indices.tolist()] sorted_class_probabilities = probabilities.squeeze()[sorted_indices].tolist() print(f"Predicted Receptor Class: {predicted_class}") print("Top 10 Class Probabilities:") for label, prob in zip(sorted_class_labels[:10], sorted_class_probabilities[:10]): print(f"{label}: {prob:.4f}") ``` Which gives this output: ``` Predicted Receptor Class: 1JXP Top 10 Class Probabilities: 1JXP: 0.7793 2OIN: 0.0058 1A1R: 0.0026 2QV1: 0.0025 3KEE: 0.0022 3KF2: 0.0016 5LAS: 0.0016 1QD6: 0.0014 6ME1: 0.0013 2XCF: 0.0013 ``` ## Evaluation Results The model was evaluated on a held-out test set, yielding the following metrics: ``` { "train/loss": 0.7338, "train/grad_norm": 4.333151340484619, "train/learning_rate": 2.3235385792411667e-8, "train/epoch": 10, "train/global_step": 352910, "_timestamp": 1711654529.5562913, "_runtime": 204515.04906344414, "_step": 715, "eval/loss": 0.7718502879142761, "eval/accuracy": 0.7761048124023759, "eval/runtime": 2734.4878, "eval/samples_per_second": 34.416, "eval/steps_per_second": 34.416, "train/train_runtime": 204505.5285, "train/train_samples_per_second": 13.806, "train/train_steps_per_second": 1.726, "train/total_flos": 143220103846625280, "train/train_loss": 1.0842229404661865, "_wandb": { "runtime": 204514 } } ```