isikz's picture
Update README.md
71d5cfe verified
|
raw
history blame
2.33 kB
---
library_name: transformers
license: apache-2.0
metrics:
- perplexity
base_model:
- facebook/esm1b_t33_650M_UR50S
---
## **Fine-Tuning ESM-1b with Multiple Sequence Alignment (MSA) for Phosphosites**
This repository provides a fine-tuned version of ESM-1b with Masked Language Modeling(MLM) Objective, incorporating genomic information by leveraging long phosphosite sequences from [DARKIN dataset](https://openreview.net/pdf?id=a4x5tbYRYV) and Multiple Sequence Alignment (MSA) of those phosphosites. The goal is to enhance the model's understanding of phosphorylation by integrating sequence conservation patterns.
### Developed by:
Zeynep Işık (MSc, Sabanci University)
### **Dataset & Preprocessing**
To construct a robust dataset, we extracted 256 MSA sequences per phosphosite from publicly available sequence databases. This resulted in a dataset of approximately 2 million sequences. Due to the large data size, the following preprocessing steps were applied:
1. Selection of MSA Sequences for Labeled Data
- Up to 10 MSA sequences were selected per human phosphosite.
- This resulted in a final dataset of 98,000 samples.
2. Dataset Splitting
- 10% of the data was reserved for validation.
- The remaining 90% was used for fine-tuning with the Masked Language Modeling (MLM) objective.
3. Data Processing & Preprocessing
- Special attention was given to conserving phosphorylation residues within sequences.
- To optimize memory efficiency, sequence lengths were truncated to 128 amino acids.
### Evaluation
Perplexity: 2.69 (decreased from 7.05)
### Usage
```
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# Load the model and tokenizer
model_name = "isikz/phosphosite_msa_finetuned_esm1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Example sequence with a masked residue
sequence = "MKTLLLTLVVV[MASK]VCLDLGYTGV"
# Tokenize input
inputs = tokenizer(sequence, return_tensors="pt")
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_token_id = torch.argmax(logits[0, 10]).item() # Assuming MASK is at position 10
predicted_token = tokenizer.decode([predicted_token_id])
print(f"Predicted Residue: {predicted_token}")
```