|
--- |
|
license: mit |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- esm |
|
- esm2 |
|
- protein language model |
|
- biology |
|
- protein token classification |
|
- secondary structure prediction |
|
--- |
|
|
|
# ESM-2 (`esm2_t6_8M_UR50D`) for Token Classification |
|
|
|
This is a fine-tuned version of [esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D) trained on the token classification task |
|
to classify amino acids in protein sequences into one of three categories `0: other`, `1: alpha helix`, `2: beta strand`. It was trained with |
|
[this notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) and achieves |
|
78.13824286786025 % accuracy. |
|
|
|
## Using the Model |
|
|
|
To use, try running: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
import numpy as np |
|
|
|
# 1. Prepare the Model and Tokenizer |
|
# Replace with the path where your trained model is saved if you're training a new model |
|
model_dir = "AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-secondary-structure" |
|
|
|
model = AutoModelForTokenClassification.from_pretrained(model_dir) |
|
tokenizer = AutoTokenizer.from_pretrained(model_dir) |
|
|
|
# Define a mapping from label IDs to their string representations |
|
label_map = {0: "Other", 1: "Helix", 2: "Strand"} |
|
|
|
# 2. Tokenize the New Protein Sequence |
|
new_protein_sequence = "MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT" # Replace with your protein sequence |
|
tokens = tokenizer.tokenize(new_protein_sequence) |
|
inputs = tokenizer.encode(new_protein_sequence, return_tensors="pt") |
|
|
|
# 3. Predict with the Model |
|
with torch.no_grad(): |
|
outputs = model(inputs).logits |
|
predictions = np.argmax(outputs[0].numpy(), axis=1) |
|
|
|
# 4. Decode the Predictions |
|
predicted_labels = [label_map[label_id] for label_id in predictions] |
|
|
|
# Print the tokens along with their predicted labels |
|
for token, label in zip(tokens, predicted_labels): |
|
print(f"{token}: {label}") |
|
``` |