File size: 5,264 Bytes
e8f4d90 a49599d 9ddf524 e8f4d90 a49599d c05f762 a49599d 97fadc1 a49599d 542ec7f 669d465 542ec7f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
license: mit
language:
- en
library_name: transformers
tags:
- esm
- esm2
- protein language model
- pLM
- biology
- multilabel sequence classification
metrics:
- f1
- precision
- recall
---
# ESM-2 Fine-tuned CAFA-5
## ESM-2 for Protein Function Prediction
This is an experimental model fine-tuned from the
[esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D) model
for multi-label classification. In particular, the model is fine-tuned on the CAFA-5 protein sequence dataset available
[here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5). More precisely, the `train_sequences.fasta` file is the
list of protein sequences that were trained on, and the
`train_terms.tsv` file contains the gene ontology protein function labels for each protein sequence. For more details on using
ESM-2 models for multi-label sequence classification, [see here](https://huggingface.co/docs/transformers/model_doc/esm).
Due to the potentially complicated class weighting necessary for the hierarchical ontology, further fine-tuning will be necessary.
## Training
The training/validation split of the data for this model is available [here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5_train_val_split_1).
Macro
```
Epoch 5/5
Training loss: 0.06925179701577704
Validation Precision: 0.9821931289359406
Validation Recall: 0.999934039607066
Validation MultilabelF1Score: 0.9907671213150024
Validation AUROC: 0.5831210653861931
```
Micro
```
Validation Precision: 0.9822020821532512
Validation Recall: 0.9999363677941498
```
## Using the model
First, download the `train_sequences.fasta` file and the `train_terms.tsv` file, and provide the local paths in the code below:
```python
import os
import numpy as np
import torch
from transformers import AutoTokenizer, EsmForSequenceClassification, AdamW
from torch.nn.functional import binary_cross_entropy_with_logits
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score
# from accelerate import Accelerator
from Bio import SeqIO
# Step 1: Data Preprocessing (Replace with your local paths)
fasta_file = "data/train_sequences.fasta"
tsv_file = "data/train_terms.tsv"
fasta_data = {}
tsv_data = {}
for record in SeqIO.parse(fasta_file, "fasta"):
fasta_data[record.id] = str(record.seq)
with open(tsv_file, 'r') as f:
for line in f:
parts = line.strip().split("\t")
tsv_data[parts[0]] = parts[1:]
# tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
seq_length = 1022
# tokenized_data = tokenizer(list(fasta_data.values()), padding=True, truncation=True, return_tensors="pt", max_length=seq_length)
unique_terms = list(set(term for terms in tsv_data.values() for term in terms))
```
Second, downlowd the file `go-basic.obo` [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5)
and store the file locally, then provide the local path in the the code below:
```python
import torch
from transformers import AutoTokenizer, EsmForSequenceClassification
from sklearn.metrics import precision_recall_fscore_support
# 1. Parsing the go-basic.obo file
def parse_obo_file(file_path):
with open(file_path, 'r') as f:
data = f.read().split("[Term]")
terms = []
for entry in data[1:]:
lines = entry.strip().split("\n")
term = {}
for line in lines:
if line.startswith("id:"):
term["id"] = line.split("id:")[1].strip()
elif line.startswith("name:"):
term["name"] = line.split("name:")[1].strip()
elif line.startswith("namespace:"):
term["namespace"] = line.split("namespace:")[1].strip()
elif line.startswith("def:"):
term["definition"] = line.split("def:")[1].split('"')[1]
terms.append(term)
return terms
parsed_terms = parse_obo_file("go-basic.obo") # Replace `go-basic.obo` with your path
# 2. Load the saved model and tokenizer
model_path = "AmelieSchreiber/esm2_t6_8M_finetuned_cafa5"
loaded_model = EsmForSequenceClassification.from_pretrained(model_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(model_path)
# 3. The predict_protein_function function
def predict_protein_function(sequence, model, tokenizer, go_terms):
inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=1022)
model.eval()
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.sigmoid(outputs.logits)
predicted_indices = torch.where(predictions > 0.05)[1].tolist()
functions = []
for idx in predicted_indices:
term_id = unique_terms[idx] # Use the unique_terms list from your training script
for term in go_terms:
if term["id"] == term_id:
functions.append(term["name"])
break
return functions
# 4. Predicting protein function for an example sequence
example_sequence = "MAYLGSLVQRRLELASGDRLEASLGVGSELDVRGDRVKAVGSLDLEEGRLEQAGVSMA" # Replace with your protein sequence
predicted_functions = predict_protein_function(example_sequence, loaded_model, loaded_tokenizer, parsed_terms)
print(predicted_functions)
```
|