AmelieSchreiber
/

esm2_t6_8M_UR50D-finetuned-secondary-structure

Token Classification

protein language model

protein token classification

secondary structure prediction

Inference Endpoints

Model card Files Files and versions Community

esm2_t6_8M_UR50D-finetuned-secondary-structure / README.md

AmelieSchreiber's picture

AmelieSchreiber

Update README.md

24b37e9 11 months ago

|

raw history blame contribute delete

No virus

1.99 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	tags:
	- esm
	- esm2
	- protein language model
	- biology
	- protein token classification
	- secondary structure prediction
	---

	# ESM-2 (`esm2_t6_8M_UR50D`) for Token Classification

	This is a fine-tuned version of [esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D) trained on the token classification task
	to classify amino acids in protein sequences into one of three categories `0: other`, `1: alpha helix`, `2: beta strand`. It was trained with
	[this notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) and achieves
	78.13824286786025 % accuracy.

	## Using the Model

	To use, try running:
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import numpy as np

	# 1. Prepare the Model and Tokenizer
	# Replace with the path where your trained model is saved if you're training a new model
	model_dir = "AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-secondary-structure"

	model = AutoModelForTokenClassification.from_pretrained(model_dir)
	tokenizer = AutoTokenizer.from_pretrained(model_dir)

	# Define a mapping from label IDs to their string representations
	label_map = {0: "Other", 1: "Helix", 2: "Strand"}

	# 2. Tokenize the New Protein Sequence
	new_protein_sequence = "MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT" # Replace with your protein sequence
	tokens = tokenizer.tokenize(new_protein_sequence)
	inputs = tokenizer.encode(new_protein_sequence, return_tensors="pt")

	# 3. Predict with the Model
	with torch.no_grad():
	outputs = model(inputs).logits
	predictions = np.argmax(outputs[0].numpy(), axis=1)

	# 4. Decode the Predictions
	predicted_labels = [label_map[label_id] for label_id in predictions]

	# Print the tokens along with their predicted labels
	for token, label in zip(tokens, predicted_labels):
	print(f"{token}: {label}")
	```