monsoon-nlp's picture
Update README.md
5d13a01 verified
metadata
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
  - monsoon-nlp/protein-pairs-uniprot-swissprot
tags:
  - sentence-transformers
  - sentence-similarity
  - transformers
  - biology
  - protein language model
license: cc
base_model: Rostlab/prot_bert_bfd

Protein Matryoshka Embeddings

The model generates an embedding for input proteins. It was trained using Matryoshka loss, so shortened embeddings can be used for faster search and other tasks.

Inputs use IUPAC-IUB codes where letters A-Z map to amino acids. For example:

"M A R N W S F R V"

The base model was Rostlab/prot_bert_bfd. A sentence-transformers model was trained on cosine-similarity of embeddings from UniProt. For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot

Usage

Install these dependencies:

pip install -U sentence-transformers datasets

Generating embeddings:

from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]

model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)

Training + Code

CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing

Results on 1,000 protein pairs from the validation dataset, during training:

steps cosine_pearson cosine_spearman
3000 0.8598688660086558 0.8666855900999677
6000 0.8692703523988448 0.8615673651584274
9000 0.8779733537629968 0.8754158959780602
12000 0.8877422045031667 0.8881492475969834
15000 0.9027359688395733 0.899106724739699
18000 0.9046675789738002 0.9044183600191271
21000 0.9165801536390973 0.9061381997421003
24000 0.9128046401341833 0.9076748537082228
27000 0.918547416546341 0.9127677526055185
30000 0.9239429677657788 0.9187051589781693

Validation

Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing

Finetuning / Tasks

One of the more popular evaluations is Tasks Assessing Protein Embeddings (TAPE)

Example using SciKit-Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing

Example using SciKit-Learn to train on a classification task from greenbeing-binary - https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing

Future

This page will be updated when I have examples using it on protein classification tasks.

I'm interested in whether embedding quantization could be even more efficient.

If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.