File size: 3,352 Bytes
b55364a 8366119 b55364a 40d542a 974fc10 241db2c b55364a 40d542a b55364a 40d542a b55364a 40d542a b55364a 40d542a b55364a 40d542a b55364a 40d542a b55364a 40d542a b55364a 40d542a b55364a 514df13 b55364a 514df13 b55364a 40d542a b55364a 0376554 415907d 4b03d7a 415907d 86b286b 415907d 40d542a b55364a 0376554 b55364a 20c06a7 b55364a 974fc10 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
- monsoon-nlp/protein-pairs-uniprot-swissprot
tags:
- sentence-transformers
- sentence-similarity
- transformers
- biology
- protein language model
license: cc
base_model: Rostlab/prot_bert_bfd
---
# Protein Matryoshka Embeddings
The model generates an embedding for input proteins. It was trained using [Matryoshka loss](https://huggingface.co/blog/matryoshka),
so shortened embeddings can be used for faster search and other tasks.
Inputs use [IUPAC-IUB codes](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation) where letters A-Z map to amino acids. For example:
"M A R N W S F R V"
The base model was [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd).
A [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model was trained on cosine-similarity of embeddings
from [UniProt](https://www.uniprot.org/help/downloads#embeddings).
For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot
## Usage
Install these dependencies:
```
pip install -U sentence-transformers datasets
```
Generating embeddings:
```python
from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]
model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)
```
## Training + Code
CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing
Results on 1,000 protein pairs from the validation dataset, during training:
|steps|cosine_pearson|cosine_spearman|
|-----|--------------|---------------|
|3000|0.8598688660086558|0.8666855900999677|
|6000|0.8692703523988448|0.8615673651584274|
|9000|0.8779733537629968|0.8754158959780602|
|12000|0.8877422045031667|0.8881492475969834|
|15000|0.9027359688395733|0.899106724739699|
|18000|0.9046675789738002|0.9044183600191271|
|21000|0.9165801536390973|0.9061381997421003|
|24000|0.9128046401341833|0.9076748537082228|
|27000|0.918547416546341|0.9127677526055185|
|30000|0.9239429677657788|0.9187051589781693|
## Validation
Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing
## Finetuning / Tasks
One of the more popular evaluations is [Tasks Assessing Protein Embeddings (TAPE)](https://github.com/songlab-cal/tape)
Example using SciKit-Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing
Example using SciKit-Learn to train on a classification task from [greenbeing-binary](https://huggingface.co/datasets/monsoon-nlp/greenbeing-binary) - https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing
## Future
This page will be updated when I have examples using it on protein classification tasks.
I'm interested in whether [embedding quantization](https://huggingface.co/blog/embedding-quantization) could be even more efficient.
If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch. |