Update README.md
Browse files
README.md
CHANGED
@@ -2,45 +2,23 @@
|
|
2 |
license: mit
|
3 |
library_name: transformers
|
4 |
datasets:
|
5 |
-
-
|
6 |
-
-
|
7 |
-
pipeline_tag:
|
8 |
tags:
|
9 |
- protein language model
|
10 |
- biology
|
11 |
widget:
|
12 |
- text: >-
|
13 |
-
|
14 |
-
|
15 |
-
M G L A Y [SEP] M I N L P S L F V P L V G L L F P A V A M A S L F L H V E K
|
16 |
-
R L L F S T K K I N
|
17 |
-
example_title: Non-interacting proteins
|
18 |
-
- text: >-
|
19 |
-
M S I N I C R D N H D P F Y R Y K M P P I Q A K V E G R G N G I K T A V L N
|
20 |
-
V A D I S H A L N R P A P Y I V K Y F G F E L G A Q T S I S V D K D R Y L V
|
21 |
-
N G V H E P A K L Q D V L D G F I N K F V L C G S C K N P E T E I I I T K D
|
22 |
-
N D L V R D C K A C G K R T P M D L R H K L S S F I L K N P P D S V S G S K
|
23 |
-
K K K K A A T A S A N V R G G G L S I S D I A Q G K S Q N A P S D G T G S S
|
24 |
-
T P Q H H D E D E D E L S R Q I K A A A S T L E D I E V K D D E W A V D M S
|
25 |
-
E E A I R A R A K E L E V N S E L T Q L D E Y G E W I L E Q A G E D K E N L
|
26 |
-
P S D V E L Y K K A A E L D V L N D P K I G C V L A Q C L F D E D I V N E I
|
27 |
-
A E H N A F F T K I L V T P E Y E K N F M G G I E R F L G L E H K D L I P L
|
28 |
-
L P K I L V Q L Y N N D I I S E E E I M R F G T K S S K K F V P K E V S K K
|
29 |
-
V R R A A K P F I T W L E T A E S D D D E E D D E [SEP] M S I E N L K S F D
|
30 |
-
P F A D T G D D E T A T S N Y I H I R I Q Q R N G R K T L T T V Q G V P E E
|
31 |
-
Y D L K R I L K V L K K D F A C N G N I V K D P E M G E I I Q L Q G D Q R A
|
32 |
-
K V C E F M I S Q L G L Q K K N I K I H G F
|
33 |
-
example_title: Interacting proteins
|
34 |
---
|
35 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/Ro4uhQDurP-x7IHJj11xa.png" width="350">
|
36 |
-
|
37 |
-
## Model description
|
38 |
|
39 |
-
|
|
|
40 |
|
41 |
-
|
42 |
|
43 |
-
SYNTERACT achieved unprecedented performance over vast phylogeny with 92-96% accuracy on real unseen examples, and is already being used to accelerate drug target screening and peptide therapeutic design.
|
44 |
|
45 |
|
46 |
## How to use
|
@@ -50,41 +28,29 @@ SYNTERACT achieved unprecedented performance over vast phylogeny with 92-96% acc
|
|
50 |
import re
|
51 |
import torch
|
52 |
import torch.nn.functional as F
|
53 |
-
from transformers import
|
54 |
|
55 |
-
model =
|
56 |
-
tokenizer = BertTokenizer.from_pretrained('lhallee/
|
57 |
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
|
58 |
model.to(device) # move to device
|
59 |
model.eval() # put in eval mode
|
60 |
|
61 |
-
|
62 |
-
|
63 |
-
sequence_a = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence_a))) # need spaces inbetween amino acids
|
64 |
-
sequence_b = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence_b))) # replace rare amino acids with X
|
65 |
-
example = sequence_a + ' [SEP] ' + sequence_b # add SEP token
|
66 |
|
67 |
-
example = tokenizer(
|
68 |
with torch.no_grad():
|
69 |
-
|
70 |
|
71 |
-
|
72 |
-
prediction = probability.argmax(dim=-1) # 0 for no interaction, 1 for interaction
|
73 |
```
|
74 |
|
75 |
## Intended use and limitations
|
76 |
-
|
77 |
|
78 |
## Our lab
|
79 |
-
The [Gleghorn lab](https://www.gleghornlab.com/) is an
|
80 |
|
81 |
## Please cite
|
82 |
-
|
83 |
-
author = {Logan Hallee and Jason P. Gleghorn},
|
84 |
-
title = {Protein-Protein Interaction Prediction is Achievable with Large Language Models},
|
85 |
-
elocation-id = {2023.06.07.544109},
|
86 |
-
year = {2023},
|
87 |
-
doi = {10.1101/2023.06.07.544109},
|
88 |
-
publisher = {Cold Spring Harbor Laboratory},
|
89 |
-
journal = {bioRxiv}
|
90 |
-
}
|
|
|
2 |
license: mit
|
3 |
library_name: transformers
|
4 |
datasets:
|
5 |
+
- CCDS
|
6 |
+
- Ensembl
|
7 |
+
pipeline_tag: feature-extraction
|
8 |
tags:
|
9 |
- protein language model
|
10 |
- biology
|
11 |
widget:
|
12 |
- text: >-
|
13 |
+
( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C \# L Q D T N N F F G A g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p ^ v ]
|
14 |
+
example_title: Example CCDS embedding extraction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
---
|
|
|
|
|
|
|
16 |
|
17 |
+
# cdsBERT
|
18 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/yA-f7tnvNNV52DK2QYNq_.png" width="350">
|
19 |
|
20 |
+
## Model description
|
21 |
|
|
|
22 |
|
23 |
|
24 |
## How to use
|
|
|
28 |
import re
|
29 |
import torch
|
30 |
import torch.nn.functional as F
|
31 |
+
from transformers import BertForMaskedLM, BertTokenizer
|
32 |
|
33 |
+
model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
|
34 |
+
tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
|
35 |
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
|
36 |
model.to(device) # move to device
|
37 |
model.eval() # put in eval mode
|
38 |
|
39 |
+
sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1
|
40 |
+
sequence = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence))) # need spaces in-between amino acids, replace rare amino acids with X
|
|
|
|
|
|
|
41 |
|
42 |
+
example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
|
43 |
with torch.no_grad():
|
44 |
+
matrix_embedding = model(**example).last_hidden_state.cpu()
|
45 |
|
46 |
+
vector_embedding = matrix_embedding.mean(dim=0)
|
|
|
47 |
```
|
48 |
|
49 |
## Intended use and limitations
|
50 |
+
|
51 |
|
52 |
## Our lab
|
53 |
+
The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation.
|
54 |
|
55 |
## Please cite
|
56 |
+
Coming soon!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|