BridgeTower from Hugging Face vs. BridgeTower from Prediction Guard
I am a starter in Hugging Face and I need some help regarding BridgeTower.
I am taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. In Lesson 2, it talked about creating embeddings on image and text using BridgeTower.
In the example code, it uses PredictionGuardClient() to create BridgeTower embeddings:
# helper function to compute the joint embedding of a prompt and a base64-encoded image through PredictionGuard
def bt_embedding_from_prediction_guard(prompt, base64_image):
# get PredictionGuard client
client = _getPredictionGuardClient()
message = {"text": prompt,}
if base64_image is not None and base64_image != "":
if not isBase64(base64_image):
raise TypeError("image input must be in base64 encoding!")
message['image'] = base64_image
response = client.embeddings.create(
model="bridgetower-large-itm-mlm-itc",
input=[message]
)
return response['data'][0]['embedding']
However, the above requires a Prediction Guard API key which is not easy to obtain. Many other learners got the same issue as well.
As a workaround, I used the Hugging Face transformer BridgeTowerProcessor and BridgeTowerModel. I refactored the above function as below:
from transformers import BridgeTowerProcessor, BridgeTowerModel
import torch
def bt_embedding_from_prediction_guard(prompt, base64_image):
processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
inputs = {"text": prompt}
if base64_image:
inputs["images"] = base64_image
# Preprocess the inputs
processed_inputs = processor(text=[inputs['text']], images=[inputs.get('images', None)], return_tensors="pt")
# Generate the embedding
with torch.no_grad():
outputs = model(**processed_inputs)
# Extract the embeddings (you can change which embedding layer to use depending on your task)
embeddings = outputs.pooler_output
return embeddings.tolist() # Return the embeddings as a list for easier use
The code runs and produces the embeddings - i got a 2048 dimension embeddings compared with the 512 dimension embedding from the sample code using Prediction Guard.
But when I calculate the cosine similarities between embeddings for different pictures, the cosine similarity calculated by using Hugging Face BridgeTower is so different from the one calculated by using Prediction Guard.
For example:
ex1_embeded (picture for a motorcycle)
ex2_embeded (picture for a motorcycle)
ex3_embeded (picture for a cat)
Results calculated by using Hugging Face BridgeTower (using my code above):
Cosine similarity between ex1_embeded and ex2_embeded is:
0.9268679323546363
Cosine similarity between ex1_embeded and ex3_embeded is:
0.8940821384304778
Results calculated by using Prediction Guard Face BridgeTower (using the sample code above):
Cosine similarity between ex1_embeded and ex2_embeded is:
0.48566270290489155
Cosine similarity between ex1_embeded and ex3_embeded is:
0.17133985252863604
I have encountered the same problem. Has anyone managed to solve it?
Hi, thanks for following the course with BridgeTower.
For comparing embeddings you should use the model that includes the contrastive head, i.e. BridgeTowerForContrastiveLearning
Here is an example:
model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
inputs = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)
cross_modal_embeddings = outputs.cross_embeds
# text_embeddings = outputs.text_embeds
# image_embeddings = outputs.image_embeds
Thank you, Shaoyent!
Hi shaoyent when I'm trying to do inference on the example image and text pair it is giving this error
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces
was not set. It will be set to True
by default. This behavior will be depracted in transformers v4.45, and will be then set to False
by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Have any one solve this problem
"Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."
Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding
Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding
I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?
Have any one solve this problem
"Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."
Hi @Parth376 , this should not be an issue for inference.
Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding
I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?
Hi
@Heiner66
@Parth376
,
For text only embeddings you can reference this code:
inputs = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)
cross_modal_embeddings = outputs.cross_embeds
text_embeddings = outputs.text_embeds
image_embeddings = outputs.image_embeds
text_embeddings
are independent of images, so you can pass a dummy image to get text-only embeddings.
Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding
I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?
Hi @Heiner66 @Parth376 ,
For text only embeddings you can reference this code:inputs = processor(images, texts, padding=True, return_tensors="pt") outputs = model(**inputs) cross_modal_embeddings = outputs.cross_embeds text_embeddings = outputs.text_embeds image_embeddings = outputs.image_embeds
text_embeddings
are independent of images, so you can pass a dummy image to get text-only embeddings.
Followed your guidance is works!!! This issue had been blocked me for days. Thank you so much!! @shaoyent
I edited it until it became like this. Did I edit it correctly?
def bt_embedding_from_prediction_guard(prompt, base64_image):
if base64_image:
if not isBase64(base64_image):
raise TypeError("Image input must be in base64 encoding!")
try:
image_data = base64.b64decode(base64_image)
image = Image.open(BytesIO(image_data)).convert("RGB")
except Exception as e:
raise ValueError("Invalid image data!") from e
else:
image = None
texts = [prompt]
images = [image] if image else None
processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
inputs = processor(images=images, text=texts, padding=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
cross_modal_embeddings = outputs.cross_embeds
return cross_modal_embeddings.squeeze().tolist()