---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
- monsoon-nlp/protein-pairs-uniprot-swissprot
tags:
- sentence-transformers
- sentence-similarity
- transformers
- biology
- protein language model
license: cc
base_model: Rostlab/prot_bert_bfd
---

# Protein Matryoshka Embeddings

The model generates an embedding for input proteins. It was trained using [Matryoshka loss](https://huggingface.co/blog/matryoshka),
so shortened embeddings can be used for faster search and other tasks.

Inputs use [IUPAC-IUB codes](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation) where letters A-Z map to amino acids. For example:

"M A R N W S F R V"

The base model was [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd).
A [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model was trained on cosine-similarity of embeddings
from [UniProt](https://www.uniprot.org/help/downloads#embeddings).
For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot


## Usage

Install these dependencies:

```
pip install -U sentence-transformers datasets
```

Generating embeddings:

```python
from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]

model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)
```


## Training + Code

CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing

Results on 1,000 protein pairs from the validation dataset, during training:

|steps|cosine_pearson|cosine_spearman|
|-----|--------------|---------------|
|3000|0.8598688660086558|0.8666855900999677|
|6000|0.8692703523988448|0.8615673651584274|
|9000|0.8779733537629968|0.8754158959780602|
|12000|0.8877422045031667|0.8881492475969834|
|15000|0.9027359688395733|0.899106724739699|
|18000|0.9046675789738002|0.9044183600191271|
|21000|0.9165801536390973|0.9061381997421003|
|24000|0.9128046401341833|0.9076748537082228|
|27000|0.918547416546341|0.9127677526055185|
|30000|0.9239429677657788|0.9187051589781693|

## Validation

Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing

## Finetuning / Tasks

One of the more popular evaluations is [Tasks Assessing Protein Embeddings (TAPE)](https://github.com/songlab-cal/tape)

Example using SciKit-Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing

Example using SciKit-Learn to train on a classification task from [greenbeing-binary](https://huggingface.co/datasets/monsoon-nlp/greenbeing-binary) - https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing

## Future

This page will be updated when I have examples using it on protein classification tasks.

I'm interested in whether [embedding quantization](https://huggingface.co/blog/embedding-quantization) could be even more efficient.

If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.