BridgeTower/bridgetower-large-itm-mlm-itc · BridgeTower from Hugging Face vs. BridgeTower from Prediction Guard

Oct 1

I am a starter in Hugging Face and I need some help regarding BridgeTower.

I am taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. In Lesson 2, it talked about creating embeddings on image and text using BridgeTower.

In the example code, it uses PredictionGuardClient() to create BridgeTower embeddings:

# helper function to compute the joint embedding of a prompt and a base64-encoded image through PredictionGuard
def bt_embedding_from_prediction_guard(prompt, base64_image):
    # get PredictionGuard client
    client = _getPredictionGuardClient()
    message = {"text": prompt,}
    if base64_image is not None and base64_image != "":
        if not isBase64(base64_image): 
            raise TypeError("image input must be in base64 encoding!")
        message['image'] = base64_image
    response = client.embeddings.create(
        model="bridgetower-large-itm-mlm-itc",
        input=[message]
    )
    return response['data'][0]['embedding']

However, the above requires a Prediction Guard API key which is not easy to obtain. Many other learners got the same issue as well.

As a workaround, I used the Hugging Face transformer BridgeTowerProcessor and BridgeTowerModel. I refactored the above function as below:

from transformers import BridgeTowerProcessor, BridgeTowerModel
import torch

def bt_embedding_from_prediction_guard(prompt, base64_image):

    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    inputs = {"text": prompt}
    
    if base64_image:
        inputs["images"] = base64_image

    # Preprocess the inputs
    processed_inputs = processor(text=[inputs['text']], images=[inputs.get('images', None)], return_tensors="pt")

    # Generate the embedding
    with torch.no_grad():
        outputs = model(**processed_inputs)
    
    # Extract the embeddings (you can change which embedding layer to use depending on your task)
    embeddings = outputs.pooler_output

    return embeddings.tolist()  # Return the embeddings as a list for easier use

The code runs and produces the embeddings - i got a 2048 dimension embeddings compared with the 512 dimension embedding from the sample code using Prediction Guard.

But when I calculate the cosine similarities between embeddings for different pictures, the cosine similarity calculated by using Hugging Face BridgeTower is so different from the one calculated by using Prediction Guard.

For example:
ex1_embeded (picture for a motorcycle)
ex2_embeded (picture for a motorcycle)
ex3_embeded (picture for a cat)

Results calculated by using Hugging Face BridgeTower (using my code above):
Cosine similarity between ex1_embeded and ex2_embeded is:
0.9268679323546363
Cosine similarity between ex1_embeded and ex3_embeded is:
0.8940821384304778

Results calculated by using Prediction Guard Face BridgeTower (using the sample code above):
Cosine similarity between ex1_embeded and ex2_embeded is:
0.48566270290489155
Cosine similarity between ex1_embeded and ex3_embeded is:
0.17133985252863604

IonutM

29 days ago

•

edited 29 days ago

I have encountered the same problem. Has anyone managed to solve it?

shaoyent

BridgeTower org 28 days ago

•

edited 28 days ago

Hi, thanks for following the course with BridgeTower.

For comparing embeddings you should use the model that includes the contrastive head, i.e. BridgeTowerForContrastiveLearning

Here is an example:

model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
# text_embeddings = outputs.text_embeds
# image_embeddings = outputs.image_embeds

IonutM

28 days ago

Thank you, Shaoyent!

Parth376

12 days ago

Hi shaoyent when I'm trying to do inference on the example image and text pair it is giving this error
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Parth376

11 days ago

Have any one solve this problem
"Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."

Parth376

11 days ago

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

Heiner66

10 days ago

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

shaoyent

BridgeTower org 9 days ago

Have any one solve this problem
"Some weights of BridgeTowerForContrastiveLearning were not initialized from the model checkpoint at BridgeTower/bridgetower-large-itm-mlm-itc and are newly initialized: ['logit_scale']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."

Hi @Parth376 , this should not be an issue for inference.

shaoyent

BridgeTower org 9 days ago

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

Hi @Heiner66 @Parth376 ,
For text only embeddings you can reference this code:


inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
text_embeddings = outputs.text_embeds
image_embeddings = outputs.image_embeds

text_embeddings are independent of images, so you can pass a dummy image to get text-only embeddings.

Heiner66

6 days ago

Hi I am also taking a course (https://www.deeplearning.ai/short-courses/multimodal-rag-chat-with-videos/) in Deeplearning.AI about creating multimodal RAG. I stuck on retrival part. which model is used in only text embedding

I'm stuck with the same problem. Have you found a solution? Does anyone else have any suggestions?

Hi @Heiner66 @Parth376 ,
For text only embeddings you can reference this code:
inputs  = processor(images, texts, padding=True, return_tensors="pt")
outputs = model(**inputs)

cross_modal_embeddings = outputs.cross_embeds
text_embeddings = outputs.text_embeds
image_embeddings = outputs.image_embeds
text_embeddings are independent of images, so you can pass a dummy image to get text-only embeddings.

Followed your guidance is works!!! This issue had been blocked me for days. Thank you so much!! @shaoyent

n3xt1lxs

about 7 hours ago

I edited it until it became like this. Did I edit it correctly?

def bt_embedding_from_prediction_guard(prompt, base64_image):
    if base64_image:
        if not isBase64(base64_image):
            raise TypeError("Image input must be in base64 encoding!")
        try:
            image_data = base64.b64decode(base64_image)
            image = Image.open(BytesIO(image_data)).convert("RGB")
        except Exception as e:
            raise ValueError("Invalid image data!") from e
    else:
        image = None

    texts = [prompt]
    images = [image] if image else None

    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")

    inputs = processor(images=images, text=texts, padding=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
    
    cross_modal_embeddings = outputs.cross_embeds

    return cross_modal_embeddings.squeeze().tolist()