File size: 3,352 Bytes
b55364a
 
 
8366119
 
b55364a
 
 
 
40d542a
974fc10
241db2c
 
b55364a
 
40d542a
b55364a
40d542a
 
b55364a
40d542a
b55364a
40d542a
b55364a
40d542a
 
 
 
 
 
 
 
 
b55364a
 
40d542a
b55364a
 
40d542a
b55364a
 
 
40d542a
b55364a
 
 
 
 
 
 
514df13
b55364a
514df13
 
 
b55364a
40d542a
 
 
 
 
 
 
 
 
 
 
 
b55364a
0376554
 
 
 
415907d
 
 
 
4b03d7a
415907d
86b286b
415907d
40d542a
b55364a
0376554
b55364a
20c06a7
b55364a
974fc10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
- monsoon-nlp/protein-pairs-uniprot-swissprot
tags:
- sentence-transformers
- sentence-similarity
- transformers
- biology
- protein language model
license: cc
base_model: Rostlab/prot_bert_bfd
---

# Protein Matryoshka Embeddings

The model generates an embedding for input proteins. It was trained using [Matryoshka loss](https://huggingface.co/blog/matryoshka),
so shortened embeddings can be used for faster search and other tasks.

Inputs use [IUPAC-IUB codes](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation) where letters A-Z map to amino acids. For example:

"M A R N W S F R V"

The base model was [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd).
A [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model was trained on cosine-similarity of embeddings
from [UniProt](https://www.uniprot.org/help/downloads#embeddings).
For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot


## Usage

Install these dependencies:

```
pip install -U sentence-transformers datasets
```

Generating embeddings:

```python
from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]

model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)
```


## Training + Code

CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing

Results on 1,000 protein pairs from the validation dataset, during training:

|steps|cosine_pearson|cosine_spearman|
|-----|--------------|---------------|
|3000|0.8598688660086558|0.8666855900999677|
|6000|0.8692703523988448|0.8615673651584274|
|9000|0.8779733537629968|0.8754158959780602|
|12000|0.8877422045031667|0.8881492475969834|
|15000|0.9027359688395733|0.899106724739699|
|18000|0.9046675789738002|0.9044183600191271|
|21000|0.9165801536390973|0.9061381997421003|
|24000|0.9128046401341833|0.9076748537082228|
|27000|0.918547416546341|0.9127677526055185|
|30000|0.9239429677657788|0.9187051589781693|

## Validation

Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing

## Finetuning / Tasks

One of the more popular evaluations is [Tasks Assessing Protein Embeddings (TAPE)](https://github.com/songlab-cal/tape)

Example using SciKit-Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing

Example using SciKit-Learn to train on a classification task from [greenbeing-binary](https://huggingface.co/datasets/monsoon-nlp/greenbeing-binary) - https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing

## Future

This page will be updated when I have examples using it on protein classification tasks.

I'm interested in whether [embedding quantization](https://huggingface.co/blog/embedding-quantization) could be even more efficient.

If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.