File size: 2,982 Bytes
4e2b988 fb2ce95 4e2b988 fb2ce95 9e9e2d6 fb2ce95 ec7cc43 fb2ce95 093f25c fb2ce95 e6e67fa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
license: mit
language:
- en
library_name: transformers
tags:
- esm
- esm-2
- sequence classifier
- proteins
- protein language model
---
# ESM-2 Sequence Classifier
This is a small sequence classifier trained on synthetic data generate by GPT-4
which classifies protein sequences into three categories `enzymes`, `transport_proteins`, and `structural_proteins`.
To use the model, try running:
``` python
# Load the trained model and tokenizer
model = EsmForSequenceClassification.from_pretrained("./esm2_t6_8M_UR50D_sequence_classifier_v1")
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
# Suppose these are your new sequences that you want to classify
# Additional Family 0: Enzymes
new_sequences_0 = [
"ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK",
"GVALDECKALDYLPGKPLPMDGKVCQCGSKTPLRP",
"VLPGYTCGELDCKPGKPLPKCGADKTQVATPFLRG",
"TCGALVQYPSCADPPVLRGSDSSVKACKKLDPQDK",
"GALCEECKLCPGADYKPMDGDRLPAAATSKTRPVG",
"PAVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYG",
"VLGYTCGALDCKPGKPLPKCGADKTQVATPFLRGA",
"CGALVQYPSCADPPVLRGSDSSVKACKKLDPQDKT",
"ALCEECKLCPGADYKPMDGDRLPAAATSKTRPVGK",
"AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR",
]
# Additional Family 1: Receptor Proteins
new_sequences_1 = [
"VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD",
"KDQVLTVPTYACRCCPKMDSKGRVPSTLRVKSARS",
"PLAGVACGRGLDYRCPRKMVPGDLQVTPATQRPYG",
"CGVRLGYPGCADVPLRGRSSFAPRACMKKDPRVTR",
"RKGVAYLYECRKLRCRADYKPRGMDGRRLPKASTT",
"RPTGAVNCKQAKVYRGLPLPMMGKVPRVCRSRRPY",
"RLDGGYTCGQALDCKPGRKPPKMGCADLKSTVATP",
"LGTCRKLVRYPQCADPPVMGRSSFRPKACCRQDPV",
"RVGYAMCSPKLCSCRADYKPPMGDGDRLPKAATSK",
"QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY",
]
# Additional Family 2: Structural Proteins
new_sequences_2 = [
"VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT",
"KDPTVMTVGTYSCQCPKQDSRGSVQPTSRVKTSRSK",
"PLVGKACGRSSDYKCPGQMVSGGSKQTPASQRPSYD",
"CGKKLVGYPSSKADVPLQGRSSFSPKACKKDPQMTS",
"RKGVASLYCSSKLSCKAQYSKGMSDGRSPKASSTTS",
"RPKSAASCEQAKSYRSLSLPSMKGKVPSKCSRSKRP",
"RSDVSYTSCSQSKDCKPSKPPKMSGSKDSSTVATPS",
"LSTCSKKVAYPSSKADPPSSGRSSFSMKACKKQDPPV",
"RVGSASSEPKSSCSVQSYSKPSMSGDSSPKASSTSK",
"QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ",
]
# Tokenize the sequences and convert to tensors
# Merge all sequences
new_sequences = new_sequences_0 + new_sequences_1 + new_sequences_2
inputs = tokenizer(new_sequences, return_tensors="pt", padding=True, truncation=True)
# Use the model to get the logits
with torch.no_grad():
logits = model(**inputs).logits
# Get the predicted class for each sequence
predicted_class_ids = torch.argmax(logits, dim=-1)
# Print the predicted class for each sequence
for sequence, predicted_class in zip(new_sequences, predicted_class_ids):
print(f"Sequence: {sequence}, Predicted class: {predicted_class.item()}")
```
This is trained using [facebook/esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D), one of the
Meta AI [ESM-2 models](https://huggingface.co/docs/transformers/model_doc/esm).
|