AmelieSchreiber commited on
Commit
fb2ce95
1 Parent(s): 4e2b988

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md CHANGED
@@ -1,3 +1,82 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - esm
8
+ - esm-2
9
+ - sequence classifier
10
+ - proteins
11
+ - protein language model
12
  ---
13
+
14
+ # ESM-2 Sequence Classifier
15
+ This is a small sequence classifier trained on synthetic data generate by GPT-4
16
+ which classifies sequences into three categories `enzymes`, `transport_proteins`, and `structural_proteins`.
17
+ To use the model, try running:
18
+
19
+ ```
20
+ # Load the trained model and tokenizer
21
+ model = EsmForSequenceClassification.from_pretrained("./esm2_t6_8M_UR50D_sequence_classifier_v1")
22
+ tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
23
+
24
+ # Suppose these are your new sequences that you want to classify
25
+ # Additional Family 0: Enzymes
26
+ new_sequences_0 = [
27
+ "ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK",
28
+ "GVALDECKALDYLPGKPLPMDGKVCQCGSKTPLRP",
29
+ "VLPGYTCGELDCKPGKPLPKCGADKTQVATPFLRG",
30
+ "TCGALVQYPSCADPPVLRGSDSSVKACKKLDPQDK",
31
+ "GALCEECKLCPGADYKPMDGDRLPAAATSKTRPVG",
32
+ "PAVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYG",
33
+ "VLGYTCGALDCKPGKPLPKCGADKTQVATPFLRGA",
34
+ "CGALVQYPSCADPPVLRGSDSSVKACKKLDPQDKT",
35
+ "ALCEECKLCPGADYKPMDGDRLPAAATSKTRPVGK",
36
+ "AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR",
37
+ ]
38
+
39
+ # Additional Family 1: Receptor Proteins
40
+ new_sequences_1 = [
41
+ "VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD",
42
+ "KDQVLTVPTYACRCCPKMDSKGRVPSTLRVKSARS",
43
+ "PLAGVACGRGLDYRCPRKMVPGDLQVTPATQRPYG",
44
+ "CGVRLGYPGCADVPLRGRSSFAPRACMKKDPRVTR",
45
+ "RKGVAYLYECRKLRCRADYKPRGMDGRRLPKASTT",
46
+ "RPTGAVNCKQAKVYRGLPLPMMGKVPRVCRSRRPY",
47
+ "RLDGGYTCGQALDCKPGRKPPKMGCADLKSTVATP",
48
+ "LGTCRKLVRYPQCADPPVMGRSSFRPKACCRQDPV",
49
+ "RVGYAMCSPKLCSCRADYKPPMGDGDRLPKAATSK",
50
+ "QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY",
51
+ ]
52
+
53
+ # Additional Family 2: Structural Proteins
54
+ new_sequences_2 = [
55
+ "VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT",
56
+ "KDPTVMTVGTYSCQCPKQDSRGSVQPTSRVKTSRSK",
57
+ "PLVGKACGRSSDYKCPGQMVSGGSKQTPASQRPSYD",
58
+ "CGKKLVGYPSSKADVPLQGRSSFSPKACKKDPQMTS",
59
+ "RKGVASLYCSSKLSCKAQYSKGMSDGRSPKASSTTS",
60
+ "RPKSAASCEQAKSYRSLSLPSMKGKVPSKCSRSKRP",
61
+ "RSDVSYTSCSQSKDCKPSKPPKMSGSKDSSTVATPS",
62
+ "LSTCSKKVAYPSSKADPPSSGRSSFSMKACKKQDPPV",
63
+ "RVGSASSEPKSSCSVQSYSKPSMSGDSSPKASSTSK",
64
+ "QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ",
65
+ ]
66
+
67
+ # Tokenize the sequences and convert to tensors
68
+ # Merge all sequences
69
+ new_sequences = new_sequences_0 + new_sequences_1 + new_sequences_2
70
+ inputs = tokenizer(new_sequences, return_tensors="pt", padding=True, truncation=True)
71
+
72
+ # Use the model to get the logits
73
+ with torch.no_grad():
74
+ logits = model(**inputs).logits
75
+
76
+ # Get the predicted class for each sequence
77
+ predicted_class_ids = torch.argmax(logits, dim=-1)
78
+
79
+ # Print the predicted class for each sequence
80
+ for sequence, predicted_class in zip(new_sequences, predicted_class_ids):
81
+ print(f"Sequence: {sequence}, Predicted class: {predicted_class.item()}")
82
+ ```