File size: 2,778 Bytes
b55364a
 
 
8366119
 
b55364a
 
 
 
40d542a
241db2c
 
b55364a
 
 
40d542a
b55364a
40d542a
 
b55364a
40d542a
b55364a
40d542a
b55364a
40d542a
 
 
 
 
 
 
 
 
b55364a
 
40d542a
b55364a
 
40d542a
b55364a
 
 
40d542a
b55364a
 
 
 
 
 
 
514df13
b55364a
514df13
 
 
b55364a
40d542a
 
 
 
 
 
 
 
 
 
 
 
b55364a
0376554
 
 
 
40d542a
b55364a
0376554
b55364a
20c06a7
b55364a
40d542a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
- monsoon-nlp/protein-pairs-uniprot-swissprot
tags:
- sentence-transformers
- sentence-similarity
- transformers
- biology
license: cc
base_model: Rostlab/prot_bert_bfd

---

# Protein Matryoshka Embeddings

The model generates an embedding for input proteins. It was trained using [Matryoshka loss](https://huggingface.co/blog/matryoshka),
so shortened embeddings can be used for faster search and other tasks.

Inputs use [IUPAC-IUB codes](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation) where letters A-Z map to amino acids. For example:

"M A R N W S F R V"

The base model was [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd).
A [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model was trained on cosine-similarity of embeddings
from [UniProt](https://www.uniprot.org/help/downloads#embeddings).
For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot


## Usage

Install these dependencies:

```
pip install -U sentence-transformers datasets
```

Generating embeddings:

```python
from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]

model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)
```


## Training + Code

CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing

Results on 1,000 protein pairs from the validation dataset, during training:

|steps|cosine_pearson|cosine_spearman|
|-----|--------------|---------------|
|3000|0.8598688660086558|0.8666855900999677|
|6000|0.8692703523988448|0.8615673651584274|
|9000|0.8779733537629968|0.8754158959780602|
|12000|0.8877422045031667|0.8881492475969834|
|15000|0.9027359688395733|0.899106724739699|
|18000|0.9046675789738002|0.9044183600191271|
|21000|0.9165801536390973|0.9061381997421003|
|24000|0.9128046401341833|0.9076748537082228|
|27000|0.918547416546341|0.9127677526055185|
|30000|0.9239429677657788|0.9187051589781693|

## Validation

Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing

## Future

This page will be updated when I have examples using it on protein classification tasks.

I'm interested in whether [embedding quantization](https://huggingface.co/blog/embedding-quantization) could be even more efficient.

If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.