Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers

shape mismatch error during inference with finetuned Model

#7
by mdmev - opened

Hello,

I recently fine-tuned this model for OCR tasks using the provided Colab notebook. The model is trained to process an input image without any additional text. However, when running inference, I encounter the following error:

shape mismatch: value tensor of shape [320, 4096] cannot be broadcast to indexing result of shape [0, 4096]
This error occurs specifically when calling the `generate` method. 

Here is the code snippet where the issue arises:
This error occurs specifically when calling the generate method:

image = example["image"]
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
        ]
    }
]
images.append(image)
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text], 
    images=images,
    return_tensors="pt",
    padding=True,
).to(device)

generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

It appears that the problem originates in the modeling_idefics2.py file where the inputs_merger method is called twice within the generate function. I added some logs to trace the issue and found that new_inputs_embeds[special_image_token_mask] becomes an empty tensor during the second call to inputs_merger.

inputs_merger()
torch.Size([1, 335])
torch.Size([1, 335, 4096])
torch.Size([320, 4096])
torch.Size([320, 4096])
inputs_merger()
torch.Size([1, 1])
torch.Size([1, 1, 4096])
torch.Size([0, 4096])
torch.Size([320, 4096])

I would greatly appreciate any assistance in resolving this issue.
Thank you very much!

hi @mdmev
can you say more about the shape of text and images (the variables, not the arguments)?
images should be a list of images.
text should be a string

Sure, actually images is a list of images as PIL objects and text is a string:

Type of text: <class 'str'>
Text: User:<image><end_of_utterance>
Assistant:
Type of images: <class 'list'>
[<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1145x269 at 0x7F9BAA5F0E>]

hmm ok, that looks about right.
is there any chance you can share samples (for instance by making public a dataset)? the best I can do to help you is reproducing the error at this point

I have just uploaded a small sample of the dataset to a public repository on my profile, which I used for fine-tuning.

Thank you very much.

HuggingFaceM4 org
edited 16 days ago

Hi! I'm trying to reproduce your error but I'm not able to.

  1. Your dataset contains both 'image_paths' and 'texts'.
  2. The images themselves are not in the dataset, so I can't debug if there is an issue with them.
  3. There are texts, even though you explain in the question that you are training the model to not need texts, so I find it odd that you have them in the dataset.

Following these constraints, I tried to reproduce your bug to the best of my abilities:

from transformers import AutoProcessor, AutoModelForPreTraining
from PIL import Image
import requests 
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-chatty")
model = AutoModelForPreTraining.from_pretrained("HuggingFaceM4/idefics2-8b-chatty")
model.to(device)

# Use default image since none were given
url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = image = Image.open(requests.get(url_1, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
        ]
    }
]
images = []
images.append(image)
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text], 
    images=images,
    return_tensors="pt",
    padding=True,
).to(device)

generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts[0])

This runs correctly and generates the following output:

User: 
Assistant: The image depicts a peaceful scene of two cats, one striped and one tabby, sleeping on a pink blanket. The striped cat is lying on its side, while the tabby cat is lying on its back. Both cats are resting comfortably, with the striped cat positioned closer to the remote control. The pink blanket they are sleeping on adds a warm and cozy atmosphere to the scene. The image captures a moment of tranquility and relaxation, as the cats enjoy their rest undisturbed.

Could you provide us with a code snippet that we can run that reproduces your bug?

I just created a repository on GitHub with the detailed code and the problem I am having. I hope you can take a look at it. Thank you in advance for your help
GitHub

Sign up or log in to comment