RuntimeError: shape mismatch:

#39
by ROSCOSMOS - opened

\blip_2\modeling_blip_2.py", line 2316, in generate
inputs_embeds[special_image_mask] = language_model_inputs.flatten()
RuntimeError: shape mismatch: value tensor of shape [81920] cannot be broadcast to indexing result of shape [0]

I am now seeing this error being returned when captioning and have no idea how to resolve it, the model had been working previously(a couple of weeks ago) now returning to it when using for captioning I see this. Has anything changed?

I have the same issue too.

same for me

same for me

Not same, but recently started getting data match errors as well out of the blue
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper at line 250373 column 3

Hey everyone! The error with shape mismatch will be solved by PR https://github.com/huggingface/transformers/pull/34876, in the meanwhile feel free to indicate the revision when modeling model/processor from pretrained, so it doesn't get the latest commit from the hub

For the tokenizer error, transformers latest versions now use tokenizers==0.20 by default and thus you would need to upgrade transformers version. The file on the hub was saved with latest transformers, and cannot be loaded with old versions, which was made for forward compatibility

same error for me, really looking forward for the solutions.
thanks devs

The PR is merged into main branch. For anyone who stumbles upon the same error, you should be able to resolve it by updating transformers to v4.47 and higher. The release for v4.47 is planned around today

Isn't the last transformers update 4.46.3 according to https://pypi.org/project/transformers/#history ?

Update, the release on PyPi got delayed for a week. So for now the workaround is to use the commit hash just before the model repo was updated:

.from_pretrained("Salesforce/blip2-opt-2.7b", revision="51572668da0eb669e01a189dc22abe6088589a24")

I have upgraded the newest transformers v4.47.dev by using

pip install git+https://github.com/huggingface/transformers

The 'shape mismatch' problem seem to be solved. However, some images got null caption when using the image-captioning model. Especially, the official example in Github of half precision (float16) method worked well, while the int8 algorithm failed and generated empty answer " ".

@gongzx do you mean only quantized model is failing to generate answer, while it worked in prev versions? Can you share how exactly do you generate?

Same can't get it to work anymore.
transformers: 4.47.0

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")

model.to(device)

def generate_caption(image_data, question=None):
    try:
        image = Image.open(io.BytesIO(image_data)).convert("RGB")
        inputs = processor(images=image, return_tensors="pt").to(device)

        if question:
            # Add the question to the prompt
            inputs['input_ids'] = processor.tokenizer.encode(question, return_tensors="pt").to(device)

        out = model.generate(**inputs, max_new_tokens=50)
        caption = processor.decode(out[0], skip_special_tokens=True)

        return caption
    except Exception as e:
        logging.error(f"Error generating caption: {e}")
        return None


Debug prints:
Inputs with question: torch.Size([1, 15])
Input tensor pixel_values shape: torch.Size([1, 3, 224, 224])
Input tensor input_ids shape: torch.Size([1, 15])
Final inputs structure: {'pixel_values': tensor([[[[ #removed to reduce clutter ]]]],
       device='cuda:0'), 'input_ids': tensor([[    2, 45641,    35,   653,    16,    42,  2909,    11,    42,  2274,  116,  31652,    35]], device='cuda:0')}
Generating output from model...
ERROR:root:Error during model.generate: shape mismatch: value tensor of shape [81920] cannot be broadcast to indexing result of shape [0]
Error during model.generate: shape mismatch: value tensor of shape [81920] cannot be broadcast to indexing result of shape [0]
ERROR:root:Error during processing: expected string or bytes-like object, got 'NoneType'

@gongzx do you mean only quantized model is failing to generate answer, while it worked in prev versions? Can you share how exactly do you generate?

Yes, I tested the same images and only adjusted the newest version of transformers. It worked in prev versions, but now it failed to generate answer. And especially, I found the example of int8 algorithm failed too. The half precision (float16) method in the official example in Github worked well, however my test images also failed(I don't know whether it's because of the lower resolution: 224*224, however it worked well just several weeks ago. My test images is in https://drive.google.com/file/d/1-dAM0D_I-roWhsE_ih245CjDaxHWQkQX/view?usp=drive_link.

The failed official example is below:

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b", cache_dir='/fs/scratch/PAS2490/blip2')
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", cache_dir='/fs/scratch/PAS2490/blip2', load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

output:

The load_in_4bit and load_in_8bit arguments are deprecated and will be removed in the future versions. Please, pass a BitsAndBytesConfig object in quantization_config argument instead.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:17<00:00, 8.62s/it]
how many dogs are in the picture?
(nothing generated)

My failed test example is below:

from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b", cache_dir='/fs/scratch/PAS2490/blip2')
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b", cache_dir='/fs/scratch/PAS2490/blip2', torch_dtype=torch.float16
)
model.to(device)

x = Image.open('./0.jpg') #This is my test images in my file path, you can download it from https://drive.google.com/file/d/1-dAM0D_I-roWhsE_ih245CjDaxHWQkQX/view?usp=drive_link
inputs = processor(images=x, return_tensors="pt").to(device, torch.float16)

generated_ids = model.generate(**inputs)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)

Output:
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:02<00:00, 1.06s/it]
(nothing generated)

@gongzx oke, I tried to run the given demo code with older version of transformers and prev revision from the hub, and it didn't work either. So I guess it is not the problem with the new changes but with the demo itself and the prompt. Formatting the prompt works better but answers wrong

Given that our tests passed and the generations for internal example prompts/images matched 100%, I don't think this is caused by the update. LMK if you have a reproducer that works in older revision/version and not with the new one

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True,  device_map="cuda:0")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "Question: how many dogs are in this picture? Answer:"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda:0", torch.float16)

out = model.generate(**inputs, max_new_tokens=20)
print(processor.decode(out[0], skip_special_tokens=True))

@TepamurLee Are you installing transformers from main? The error was solved in main branch but unfortunately the release was delayed for a week or so. So until the release please install with !pip install --upgrade git+https://github.com/huggingface/transformers.git

@RaushanTurganbay . Thank you for your reply. What about the captioning of my test image? I've test it using the second code above, it still failed to generate answer. Should I introduce any additioal prompt? Today, I installed lavis to ahieve BLIP-2's captioning task, and it successfully generated the caption:"two pink flowers with a bee sitting on one of them".

@gongzx so you mean the original impl is not matching transformers not the old transformers vs new transformers version?

Did you try to generate with .from_pretrained("Salesforce/blip2-opt-2.7b", revision="51572668da0eb669e01a189dc22abe6088589a24") where the weights/configs are yet not updated? If the generation still shows nothing in that case, the reason is in conversion from LAVIS to transformers and I'd like to ask you to open an issue in transformers

@RaushanTurganbay I think you are right. I used revision="51572668da0eb669e01a189dc22abe6088589a24" to get correct captions. Thank you very much for your help!

@RaushanTurganbay revision="51572668da0eb669e01a189dc22abe6088589a24") works fine again, thanks!! I propably mixed up the transformer versions due to troubleshooting tries earlier.

How to use before version of blip2-opt-2.7b??

import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b", revision="51572668da0eb669e01a189dc22abe6088589a24")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", revision="51572668da0eb669e01a189dc22abe6088589a24")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

I got this log

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  4.61it/s]
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.

tokenizers 0.20.1
torch 2.4.1
torchvision 0.19.1
transformers 4.45.2

How do i fix this issue ?

After update transformers

import torch
import requests
from PIL import Image
import transformers
from transformers import Blip2Processor, Blip2ForConditionalGeneration

print(f"transformers.__version__: {transformers.__version__}")

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", device_map="cuda:0")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "Question: how many dogs are in this picture? Answer:"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda:0", torch.float16)

out = model.generate(**inputs, max_new_tokens=20)
print(processor.decode(out[0], skip_special_tokens=True))
transformers.__version__: 4.47.0.dev0
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:05<00:00,  2.80s/it]
Question: how many dogs are in this picture? Answer: none

UPDATE: we release v4.47 and the fix is already available in that release. Feel free to update transformers and load the model without revision

Sign up or log in to comment