File size: 4,027 Bytes
8ac6808
 
0a52d93
 
 
 
 
 
 
 
 
 
 
8ac6808
52a3820
0a52d93
a7bd966
 
 
52a3820
b4c1176
 
52a3820
 
d15f1ce
52a3820
 
 
 
 
 
 
 
 
 
 
 
 
0a52d93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: mit
language:
- en
library_name: transformers
tags:
- ems
- esm2
- biology
- protein
- protein language model
- cafa 5
- protein function prediction
---
# ESM-2 for Protein Function Prediction

This model is not intended for protein function prediction, but rather as a checkpoint for further fine-tuning, especially
with Low Rank Adaptation (LoRA). This is an experimental model fine-tuned from the 
[esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D) model 
for multi-label classification. In particular, the model is fine-tuned on the CAFA-5 protein sequence dataset available 
[here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5). More precisely, the `train_sequences.fasta` file is the 
list of protein sequences that were trained on, and the 
`train_terms.tsv` file contains the gene ontology protein function labels for each protein sequence. For more details on using 
ESM-2 models for multi-label sequence classification, [see here](https://huggingface.co/docs/transformers/model_doc/esm). 
Due to the potentially complicated class weighting necessary for the hierarchical ontology, further fine-tuning will be necessary. 

## Fine-Tuning

The model was fine-tuned for 7 epochs at a learning rate of `5e-5`, and achieves the following metrics:
```
Validation Loss: 0.0027,
Validation Micro F1: 0.3672,
Validation Macro F1: 0.9967,
Validation Micro Precision: 0.6052,
Validation Macro Precision: 0.9996,
Validation Micro Recall: 0.2626,
Validation Macro Recall: 0.9966
```

## Using the model
First, downlowd the file `go-basic.obo` [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5)
and store the file locally, then provide the local path in the the code below:

```python
import torch
from transformers import AutoTokenizer, EsmForSequenceClassification
from sklearn.metrics import precision_recall_fscore_support

# 1. Parsing the go-basic.obo file
def parse_obo_file(file_path):
    with open(file_path, 'r') as f:
        data = f.read().split("[Term]")
        
    terms = []
    for entry in data[1:]:
        lines = entry.strip().split("\n")
        term = {}
        for line in lines:
            if line.startswith("id:"):
                term["id"] = line.split("id:")[1].strip()
            elif line.startswith("name:"):
                term["name"] = line.split("name:")[1].strip()
            elif line.startswith("namespace:"):
                term["namespace"] = line.split("namespace:")[1].strip()
            elif line.startswith("def:"):
                term["definition"] = line.split("def:")[1].split('"')[1]
        terms.append(term)
    return terms

parsed_terms = parse_obo_file("go-basic.obo")  # Replace `go-basic.obo` with your path

# 2. Load the saved model and tokenizer
model_path = "AmelieSchreiber/cafa_5_protein_function_prediction"
loaded_model = EsmForSequenceClassification.from_pretrained(model_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(model_path)

# 3. The predict_protein_function function
def predict_protein_function(sequence, model, tokenizer, go_terms):
    inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=1022)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.sigmoid(outputs.logits)
        predicted_indices = torch.where(predictions > 0.05)[1].tolist()
    
    functions = []
    for idx in predicted_indices:
        term_id = unique_terms[idx]  # Use the unique_terms list from your training script
        for term in go_terms:
            if term["id"] == term_id:
                functions.append(term["name"])
                break
                
    return functions

# 4. Predicting protein function for an example sequence
example_sequence = "MAYLGSLVQRRLELASGDRLEASLGVGSELDVRGDRVKAVGSLDLEEGRLEQAGVSMA"  # Replace with your protein sequence
predicted_functions = predict_protein_function(example_sequence, loaded_model, loaded_tokenizer, parsed_terms)
print(predicted_functions)
```