File size: 1,986 Bytes
0a63d8c 6f74510 0a63d8c 24b37e9 6f74510 0a63d8c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
---
license: mit
language:
- en
library_name: transformers
tags:
- esm
- esm2
- protein language model
- biology
- protein token classification
- secondary structure prediction
---
# ESM-2 (`esm2_t6_8M_UR50D`) for Token Classification
This is a fine-tuned version of [esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D) trained on the token classification task
to classify amino acids in protein sequences into one of three categories `0: other`, `1: alpha helix`, `2: beta strand`. It was trained with
[this notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) and achieves
78.13824286786025 % accuracy.
## Using the Model
To use, try running:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import numpy as np
# 1. Prepare the Model and Tokenizer
# Replace with the path where your trained model is saved if you're training a new model
model_dir = "AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-secondary-structure"
model = AutoModelForTokenClassification.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
# Define a mapping from label IDs to their string representations
label_map = {0: "Other", 1: "Helix", 2: "Strand"}
# 2. Tokenize the New Protein Sequence
new_protein_sequence = "MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT" # Replace with your protein sequence
tokens = tokenizer.tokenize(new_protein_sequence)
inputs = tokenizer.encode(new_protein_sequence, return_tensors="pt")
# 3. Predict with the Model
with torch.no_grad():
outputs = model(inputs).logits
predictions = np.argmax(outputs[0].numpy(), axis=1)
# 4. Decode the Predictions
predicted_labels = [label_map[label_id] for label_id in predictions]
# Print the tokens along with their predicted labels
for token, label in zip(tokens, predicted_labels):
print(f"{token}: {label}")
``` |