Token or Sentence Embeddings

#49

by rufimelo - opened Jul 17, 2022

Discussion

rufimelo

Jul 17, 2022

In the past few week, I've been messing around with SBERTs.

Is it possible to create embeddings using bloom to perform Semantic Search?

thomwolf

BigScience Workshop org Jul 18, 2022

If you download the model and run it yourself yes, but it's not provided through the model widget API

Muennighoff

BigScience Workshop org Jul 18, 2022

Hey! I created an embedding model of BLOOM 1B3 here that may be of interest to you: https://huggingface.co/bigscience-catalogue-lm-data/sgpt-nli-bloom-1b3

If it is of interest, I can create a similar fine-tuned embedding model for this 176B model, but do note that embeddings would be very expensive to retrieve. Further storing them would require a lot of space if we don't add a linear layer to reduce their dim.

Else, you can ofc load the model in HF with AutoModel and produce embeddings, but they will not perform well without fine-tuning like is done for the model above.

rufimelo

Jul 18, 2022

Thank you for the feedback!

rufimelo changed discussion status to closed Jul 18, 2022

rufimelo changed discussion status to open Jul 18, 2022

rufimelo

Jul 18, 2022

Versions:

-transformers: 4.20.1
-sentence-transformers: 2.2.2

I could not load the model using:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("bigscience-catalogue-lm-data/sgpt-nli-bloom-1b3")

,but it ended up working by creating the SentenceTransformer model using the word_embedding model.

from sentence_transformers import SentenceTransformer, models, evaluation
word_embedding_model = models.Transformer('bigscience-catalogue-lm-data/sgpt-nli-bloom-1b3', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

model2 = SentenceTransformer(modules=[word_embedding_model, pooling_model])

I was trying to fine-tune it like I did previously on sentence transformers and It did not work.

#load dataset
from datasets import load_dataset
dataset = load_dataset("assin")

from sentence_transformers import SentenceTransformer, InputExample, losses, models, evaluation
from torch.utils.data import DataLoader
train_examples = [InputExample(texts=[texts['premise'], texts['hypothesis']], label=texts['relatedness_score']/5) for texts in dataset['train']]

#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model)

#Tune the model
model2.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=0.1*len(train_dataloader))

Output:

Do I need a specific version ? The first example should have loaded correctly the model, right?

Am I fine tuning it the wrong way with this type of model?
Thanks in advance

Muennighoff

BigScience Workshop org Jul 19, 2022

•

edited Jul 20, 2022

To load it you need to either use this branch of SentenceTransformers: https://github.com/UKPLab/sentence-transformers/pull/1613 or install sentence transformers from this repo: https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco
By loading it via one of the ways above, your problem will probably be solved. Note though that the model is already fine-tuned. If you want to fine-tune further, you probably also need to use one of the two repos above as the pooling of the model is not implemented in sentence transformers. If you have a lot of data for your use-case it might make more sense to just fine-tune from scratch again.

Hope this helps 🤗

rufimelo

Jul 19, 2022

Thanks, I believe it works. I just need to have a GPU available to train and complete the fine-tuning.

I just have one other question.
On SBERT models, we can perform "Domain Adaptation" of the BERT model, before creating a SBERT one.
It would allow our model to be more familiarized with our specific context.

On my side, I only have access to raw texts.
I was wondering if I could perform something similar to this model, with relatively scarce computational resources.

rufimelo

Jul 19, 2022

This comment has been hidden

rufimelo

Jul 19, 2022

Also, in order to fine-tune such a model, what GPU size do I need?

Muennighoff

BigScience Workshop org Jul 20, 2022

If you don't have aligned texts i.e. just raw, you probably want to do fine-tune the BERT model not the SentenceTransformer? If you're texts are aligned then for the model above you can probably fine-tune it with 1x A100 40GB or even 1xV100 36GB. If you have more, training will be faster via data parallelism. I'd recommend using GradCache to get a larger batch size as implemented in this repo: https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco

rufimelo

Jul 20, 2022

Yes, it would be the BERT in that case.

Thanks!

Muennighoff

BigScience Workshop org Jul 21, 2022

Sure, if you want to fine-tune BERT you can do that easily in sentence-transformers with mean pooling. For fine-tuning bloom / other auto-regressive models for embeddings, it may be better to use the repo I sent above with weighted mean pooling (https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco).

Closing this issue, feel free to reopen if you still have questions 👻

Muennighoff changed discussion status to closed Jul 21, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment