--- license: mit language: - en library_name: transformers tags: - esm - esm2 - protein language model - biology - protein token classification - secondary structure prediction --- # ESM-2 (`esm2_t6_8M_UR50D`) for Token Classification This is a fine-tuned version of [esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D) trained on the token classification task to classify amino acids in protein sequences into one of three categories `0: other`, `1: alpha helix`, `2: beta strand`. It was trained with [this notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) and achieves 78.13824286786025 % accuracy. ## Using the Model To use, try running: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import numpy as np # 1. Prepare the Model and Tokenizer # Replace with the path where your trained model is saved if you're training a new model model_dir = "AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-secondary-structure" model = AutoModelForTokenClassification.from_pretrained(model_dir) tokenizer = AutoTokenizer.from_pretrained(model_dir) # Define a mapping from label IDs to their string representations label_map = {0: "Other", 1: "Helix", 2: "Strand"} # 2. Tokenize the New Protein Sequence new_protein_sequence = "MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT" # Replace with your protein sequence tokens = tokenizer.tokenize(new_protein_sequence) inputs = tokenizer.encode(new_protein_sequence, return_tensors="pt") # 3. Predict with the Model with torch.no_grad(): outputs = model(inputs).logits predictions = np.argmax(outputs[0].numpy(), axis=1) # 4. Decode the Predictions predicted_labels = [label_map[label_id] for label_id in predictions] # Print the tokens along with their predicted labels for token, label in zip(tokens, predicted_labels): print(f"{token}: {label}") ```