GleghornLab
/

cdsBERT-plus

@@ -2,45 +2,23 @@
 license: mit
 library_name: transformers
 datasets:
-- BIOGRID
-- Negatome
-pipeline_tag: text-classification
 tags:
 - protein language model
 - biology
 widget:
 - text: >-
-    M S H S V K I Y D T C I G C T Q C V R A C P T D V L E M I P W G G C K A K Q
-    I A S A P R T E D C V G C K R C E S A C P T D F L S V R V Y L W H E T T R S
-    M G L A Y [SEP] M I N L P S L F V P L V G L L F P A V A M A S L F L H V E K
-    R L L F S T K K I N
-  example_title: Non-interacting proteins
-- text: >-
-    M S I N I C R D N H D P F Y R Y K M P P I Q A K V E G R G N G I K T A V L N
-    V A D I S H A L N R P A P Y I V K Y F G F E L G A Q T S I S V D K D R Y L V
-    N G V H E P A K L Q D V L D G F I N K F V L C G S C K N P E T E I I I T K D
-    N D L V R D C K A C G K R T P M D L R H K L S S F I L K N P P D S V S G S K
-    K K K K A A T A S A N V R G G G L S I S D I A Q G K S Q N A P S D G T G S S
-    T P Q H H D E D E D E L S R Q I K A A A S T L E D I E V K D D E W A V D M S
-    E E A I R A R A K E L E V N S E L T Q L D E Y G E W I L E Q A G E D K E N L
-    P S D V E L Y K K A A E L D V L N D P K I G C V L A Q C L F D E D I V N E I
-    A E H N A F F T K I L V T P E Y E K N F M G G I E R F L G L E H K D L I P L
-    L P K I L V Q L Y N N D I I S E E E I M R F G T K S S K K F V P K E V S K K
-    V R R A A K P F I T W L E T A E S D D D E E D D E [SEP] M S I E N L K S F D
-    P F A D T G D D E T A T S N Y I H I R I Q Q R N G R K T L T T V Q G V P E E
-    Y D L K R I L K V L K K D F A C N G N I V K D P E M G E I I Q L Q G D Q R A
-    K V C E F M I S Q L G L Q K K N I K I H G F
-  example_title: Interacting proteins
 ---
-<img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/Ro4uhQDurP-x7IHJj11xa.png" width="350">
-## Model description
-SYNTERACT (SYNThetic data-driven protein-protein intERACtion Transformer) is a fine-tuned version of [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) that attends two amino acid sequences separated by [SEP] to determine if they plausibly interact in biological context.
-We utilized the multivalidated physical interaction dataset from BIORGID, Negatome, and synthetic negative samples to train our model. Check out our [preprint](https://www.biorxiv.org/content/10.1101/2023.06.07.544109v1.full) for more details.
-SYNTERACT achieved unprecedented performance over vast phylogeny with 92-96% accuracy on real unseen examples, and is already being used to accelerate drug target screening and peptide therapeutic design.
 ## How to use
@@ -50,41 +28,29 @@ SYNTERACT achieved unprecedented performance over vast phylogeny with 92-96% acc
 import re
 import torch
 import torch.nn.functional as F
-from transformers import BertForSequenceClassification, BertTokenizer
-model = BertForSequenceClassification.from_pretrained('lhallee/SYNTERACT') # load model
-tokenizer = BertTokenizer.from_pretrained('lhallee/SYNTERACT') # load tokenizer
 device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
 model.to(device) # move to device
 model.eval() # put in eval mode
-sequence_a = 'MEKSCSIGNGREQYGWGHGEQCGTQFLECVYRNASMYSVLGDLITYVVFLGATCYAILFGFRLLLSCVRIVLKVVIALFVIRLLLALGSVDITSVSYSG' # Uniprot A1Z8T3
-sequence_b = 'MRLTLLALIGVLCLACAYALDDSENNDQVVGLLDVADQGANHANDGAREARQLGGWGGGWGGRGGWGGRGGWGGRGGWGGRGGWGGGWGGRGGWGGRGGGWYGR' # Uniprot A1Z8H0
-sequence_a = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence_a))) # need spaces inbetween amino acids
-sequence_b = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence_b))) # replace rare amino acids with X
-example = sequence_a + ' [SEP] ' + sequence_b # add SEP token
-example = tokenizer(example, return_tensors='pt', padding=False).to(device) # tokenize example
 with torch.no_grad():
-    logits = model(**example).logits.cpu().detach() # get logits from model
-probability = F.softmax(output, dim=-1) # use softmax to get "confidence" in the prediction
-prediction = probability.argmax(dim=-1) # 0 for no interaction, 1 for interaction
 ```
 ## Intended use and limitations
-We define a protein-protein interaction as physical contact that mediates chemical or conformational change, especially with non-generic function. However, due to SYNTERACTS propensity to predict false positives we believe that it identifies plausible conformational changes caused by interactions without relevance to function. Therefore, predictions by SYNTERACT should always be taken with a grain of salt and used as a means of hypothesis generation or secondary validation.
 ## Our lab
-The [Gleghorn lab](https://www.gleghornlab.com/) is an interdiciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation.
 ## Please cite
-@article {Hallee2023.06.07.544109,
-	author = {Logan Hallee and Jason P. Gleghorn},
-	title = {Protein-Protein Interaction Prediction is Achievable with Large Language Models},
-	elocation-id = {2023.06.07.544109},
-	year = {2023},
-	doi = {10.1101/2023.06.07.544109},
-	publisher = {Cold Spring Harbor Laboratory},
-	journal = {bioRxiv}
-}

 license: mit
 library_name: transformers
 datasets:
+- CCDS
+- Ensembl
+pipeline_tag: feature-extraction
 tags:
 - protein language model
 - biology
 widget:
 - text: >-
+    ( Z E V L P Y G D E K L S P Y G D G G D V G Q I F s C \# L Q D T N N F F G A g Q N K % O P K L G Q I G % S K % u u i e d d R i d D V L k n ( T D K @ p p ^ v ]
+  example_title: Example CCDS embedding extraction
 ---
+# cdsBERT
+<img src="https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/yA-f7tnvNNV52DK2QYNq_.png" width="350">
+## Model description
 ## How to use
 import re
 import torch
 import torch.nn.functional as F
+from transformers import BertForMaskedLM, BertTokenizer
+model = BertForMaskedLM.from_pretrained('lhallee/cdsBERT') # load model
+tokenizer = BertTokenizer.from_pretrained('lhallee/cdsBERT') # load tokenizer
 device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
 model.to(device) # move to device
 model.eval() # put in eval mode
+sequence = '(ZEVLPYGDEKLSPYGDGGDVGQIFsC#LQDTNNFFGAgQNK%OPKLGQIG%SK%uuieddRidDVLkn(TDK@pp^v]' # CCDS207.1|Hs110|chr1
+sequence = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence))) # need spaces in-between amino acids, replace rare amino acids with X
+example = tokenizer(sequence, return_tensors='pt', padding=False).to(device) # tokenize example
 with torch.no_grad():
+    matrix_embedding = model(**example).last_hidden_state.cpu()
+vector_embedding = matrix_embedding.mean(dim=0)
 ```
 ## Intended use and limitations
 ## Our lab
+The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group out of the University of Delaware that focuses on translational problems around biomedicine. Recently we have begun exploration into protein language models and are passionate about excellent protein design and annotation.
 ## Please cite
+Coming soon!