monsoon-nlp
/

protein-matryoshka-embeddings

Sentence Similarity

sentence-transformers

feature-extraction

Inference Endpoints

text-embeddings-inference

Model card Files Files and versions Community

protein-matryoshka-embeddings / README.md

monsoon-nlp's picture

add link to scatter plot notebook

0376554 verified 3 months ago

|

raw history blame

No virus

2.78 kB

	---
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	datasets:
	- monsoon-nlp/protein-pairs-uniprot-swissprot
	tags:
	- sentence-transformers
	- sentence-similarity
	- transformers
	- biology
	license: cc
	base_model: Rostlab/prot_bert_bfd

	---

	# Protein Matryoshka Embeddings

	The model generates an embedding for input proteins. It was trained using [Matryoshka loss](https://huggingface.co/blog/matryoshka),
	so shortened embeddings can be used for faster search and other tasks.

	Inputs use [IUPAC-IUB codes](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation) where letters A-Z map to amino acids. For example:

	"M A R N W S F R V"

	The base model was [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd).
	A [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model was trained on cosine-similarity of embeddings
	from [UniProt](https://www.uniprot.org/help/downloads#embeddings).
	For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot


	## Usage

	Install these dependencies:

	```
	pip install -U sentence-transformers datasets
	```

	Generating embeddings:

	```python
	from sentence_transformers import SentenceTransformer
	sequences = ["M S L E Q K...", "M A R N W S F R V..."]

	model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
	embeddings = model.encode(sentences)
	print(embeddings)
	```


	## Training + Code

	CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing

	Results on 1,000 protein pairs from the validation dataset, during training:

	\|steps\|cosine_pearson\|cosine_spearman\|
	\|-----\|--------------\|---------------\|
	\|3000\|0.8598688660086558\|0.8666855900999677\|
	\|6000\|0.8692703523988448\|0.8615673651584274\|
	\|9000\|0.8779733537629968\|0.8754158959780602\|
	\|12000\|0.8877422045031667\|0.8881492475969834\|
	\|15000\|0.9027359688395733\|0.899106724739699\|
	\|18000\|0.9046675789738002\|0.9044183600191271\|
	\|21000\|0.9165801536390973\|0.9061381997421003\|
	\|24000\|0.9128046401341833\|0.9076748537082228\|
	\|27000\|0.918547416546341\|0.9127677526055185\|
	\|30000\|0.9239429677657788\|0.9187051589781693\|

	## Validation

	Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing

	## Future

	This page will be updated when I have examples using it on protein classification tasks.

	I'm interested in whether [embedding quantization](https://huggingface.co/blog/embedding-quantization) could be even more efficient.

	If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.