File size: 3,167 Bytes
4e2b988
 
fb2ce95
 
 
 
 
 
 
 
 
c9a0891
4e2b988
fb2ce95
 
b6c8d57
1d0e606
499690a
c589526
90d82ce
62b7d50
 
fb2ce95
 
4ee1634
fb2ce95
093f25c
fb2ce95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6e67fa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: mit
language:
- en
library_name: transformers
tags:
- esm
- esm-2
- sequence classifier
- proteins
- protein language model
pipeline_tag: zero-shot-classification
---

# ESM-2 Sequence Classifier
This is a small sequence classifier trained on synthetic data generated by GPT-4 
which classifies protein sequences into three categories `enzymes` (class `0`), `receptor_proteins` (class `1`), and `structural_proteins` (class `2`). 
This is trained using [facebook/esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D), one of the [ESM-2 models](https://huggingface.co/docs/transformers/model_doc/esm). 

This model is not well tested, and is for experimental and eductaional purposes. Use with caution. 

## Using the Model
To use the model, try running:

```python
# Load the trained model and tokenizer
model = EsmForSequenceClassification.from_pretrained("./esm2_t6_8M_UR50D_sequence_classifier_v1")
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

# Suppose these are your new sequences that you want to classify
# Additional Family 0: Enzymes
new_sequences_0 = [
    "ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK",
    "GVALDECKALDYLPGKPLPMDGKVCQCGSKTPLRP",
    "VLPGYTCGELDCKPGKPLPKCGADKTQVATPFLRG",
    "TCGALVQYPSCADPPVLRGSDSSVKACKKLDPQDK",
    "GALCEECKLCPGADYKPMDGDRLPAAATSKTRPVG",
    "PAVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYG",
    "VLGYTCGALDCKPGKPLPKCGADKTQVATPFLRGA",
    "CGALVQYPSCADPPVLRGSDSSVKACKKLDPQDKT",
    "ALCEECKLCPGADYKPMDGDRLPAAATSKTRPVGK",
    "AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR",
]

# Additional Family 1: Receptor Proteins
new_sequences_1 = [
    "VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD",
    "KDQVLTVPTYACRCCPKMDSKGRVPSTLRVKSARS",
    "PLAGVACGRGLDYRCPRKMVPGDLQVTPATQRPYG",
    "CGVRLGYPGCADVPLRGRSSFAPRACMKKDPRVTR",
    "RKGVAYLYECRKLRCRADYKPRGMDGRRLPKASTT",
    "RPTGAVNCKQAKVYRGLPLPMMGKVPRVCRSRRPY",
    "RLDGGYTCGQALDCKPGRKPPKMGCADLKSTVATP",
    "LGTCRKLVRYPQCADPPVMGRSSFRPKACCRQDPV",
    "RVGYAMCSPKLCSCRADYKPPMGDGDRLPKAATSK",
    "QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY",
]

# Additional Family 2: Structural Proteins
new_sequences_2 = [
    "VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT",
    "KDPTVMTVGTYSCQCPKQDSRGSVQPTSRVKTSRSK",
    "PLVGKACGRSSDYKCPGQMVSGGSKQTPASQRPSYD",
    "CGKKLVGYPSSKADVPLQGRSSFSPKACKKDPQMTS",
    "RKGVASLYCSSKLSCKAQYSKGMSDGRSPKASSTTS",
    "RPKSAASCEQAKSYRSLSLPSMKGKVPSKCSRSKRP",
    "RSDVSYTSCSQSKDCKPSKPPKMSGSKDSSTVATPS",
    "LSTCSKKVAYPSSKADPPSSGRSSFSMKACKKQDPPV",
    "RVGSASSEPKSSCSVQSYSKPSMSGDSSPKASSTSK",
    "QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ",
]

# Tokenize the sequences and convert to tensors
# Merge all sequences
new_sequences = new_sequences_0 + new_sequences_1 + new_sequences_2
inputs = tokenizer(new_sequences, return_tensors="pt", padding=True, truncation=True)

# Use the model to get the logits
with torch.no_grad():
    logits = model(**inputs).logits

# Get the predicted class for each sequence
predicted_class_ids = torch.argmax(logits, dim=-1)

# Print the predicted class for each sequence
for sequence, predicted_class in zip(new_sequences, predicted_class_ids):
    print(f"Sequence: {sequence}, Predicted class: {predicted_class.item()}")
```