lhallee commited on
Commit
b194781
1 Parent(s): 8d492cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -2
README.md CHANGED
@@ -5,10 +5,31 @@ pipeline_tag: text-classification
5
  tags:
6
  - protein language model
7
  ---
 
8
 
 
9
 
 
10
 
11
- Bibtex citation:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  @article {Hallee2023.06.07.544109,
13
  author = {Logan Hallee and Jason P. Gleghorn},
14
  title = {Protein-Protein Interaction Prediction is Achievable with Large Language Models},
@@ -16,7 +37,6 @@ Bibtex citation:
16
  year = {2023},
17
  doi = {10.1101/2023.06.07.544109},
18
  publisher = {Cold Spring Harbor Laboratory},
19
- abstract = {Predicting protein-protein interactions (PPIs) is vital for elucidating fundamental biology, designing peptide therapeutics, and for high-throughput protein annotation. This is particularly relevant in the current biotechnology landscape characterized by the proliferation of protein generative models, which necessitate a high-throughput and generalized PPI predictor for proteins regardless of conventional motifs or known biological functions. Our work addresses this need and provides strong evidence of the utility and reliability of protein language models (pLMs) in learning the PPI objective. We demonstrated that with the use of a sizable balanced dataset, pLMs achieve state-of-the-art performance metrics in PPI prediction on diverse proteins. To generate a dataset that allows for the approximation of these conditions, we implemented a novel synthetic data generation scheme to augment BIOGRID and Negatome datasets. The enhancement of these datasets was then used to fine-tune ProtBERT for PPI prediction to develop a model that we call SYNTERACT (SYNThetic data-driven protein-protein intERACtion Transformer). Our results are compelling, demonstrating 92\% accuracy on validated positive and negative interacting pairs derived from 50 different organisms, all of which were excluded from the training phase. In addition to the high metrics, secondary analysis revealed that our synthetic negative data was able to successfully mimic actual negative samples, further reinforcing the integrity of synthetic data additions to PPI datasets. Another notable discovery was the ease in which previously existing PPI datasets could be predicted with simplistic features, calling into question if they can actually inform PPI prediction. We find that the subcellular compartment bias inherent to the compilation of these datasets is learnable with deep learning methods and demonstrate that our approach is not burdened by this disadvantage.Competing Interest StatementThe authors have declared no competing interest.},
20
  URL = {https://www.biorxiv.org/content/early/2023/06/09/2023.06.07.544109},
21
  eprint = {https://www.biorxiv.org/content/early/2023/06/09/2023.06.07.544109.full.pdf},
22
  journal = {bioRxiv}
 
5
  tags:
6
  - protein language model
7
  ---
8
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/Ro4uhQDurP-x7IHJj11xa.png)
9
 
10
+ ## Model description
11
 
12
+ SYNTERACT (SYNThetic data-driven protein-protein intERACtion Transformer) is a fine-tuned version of [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) that attends two amino acid sequences separated by [SEP] to determine if they plausibly interact in biological context.
13
 
14
+ We utilized the multivalidated physical interaction dataset from BIORGID, Negatome, and synthetic negative samples to train our model. Check out our [preprint](https://www.biorxiv.org/content/10.1101/2023.06.07.544109v1.full) for more details.
15
+
16
+ SYNTERACT achieved unprecedented performance over vast phylogeny with 92-96% accuracy on real unseen examples, and is already being used to accelerate drug target screening and peptide therapeutic design.
17
+
18
+
19
+ ## How to use
20
+
21
+ ```python
22
+
23
+
24
+
25
+ ```
26
+
27
+
28
+
29
+ ## Intended use and limitations
30
+ We define a protein-protein interaction as physical contact that mediates chemical or conformational change, especially with non-generic function. However, due to SYNTERACTS propensity to predict false positives we believe that it identifies plausible conformational changes caused by interactions without relevance to function. Therefore, predictions by SYNTERACT should always be taken with a grain of salt and used as a means of hypothesis generation or secondary validation.
31
+
32
+ ## Please cite
33
  @article {Hallee2023.06.07.544109,
34
  author = {Logan Hallee and Jason P. Gleghorn},
35
  title = {Protein-Protein Interaction Prediction is Achievable with Large Language Models},
 
37
  year = {2023},
38
  doi = {10.1101/2023.06.07.544109},
39
  publisher = {Cold Spring Harbor Laboratory},
 
40
  URL = {https://www.biorxiv.org/content/early/2023/06/09/2023.06.07.544109},
41
  eprint = {https://www.biorxiv.org/content/early/2023/06/09/2023.06.07.544109.full.pdf},
42
  journal = {bioRxiv}