llava-hf/llava-v1.6-mistral-7b-hf · Batch processing expects images of same size

Apr 9, 2024

•

edited Apr 9, 2024

On processing a batch of images with the same prompt, if the images are of different size, the stacking of image features throws an error

transformers/models/llava_next/modeling_llava_next.py", line 553, in forward
image_features = torch.stack(new_image_features, dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [2732, 4096] at entry 0 and [2160, 4096] at entry 1

Just want to know if this is intentional, because llava-v1.5 has no such issues and accepts images of arbitrary sizes in a batch.

Digging it a bit further, the preprocessing works as expected, takes 5 crops from each image of the following aspect ratios:
[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008] --> line 185 in image_processing_llava_next.py

Each crop is then of size (3, 336, 336) after preprocessing.

The issue is because of this line in modelling_llava_next.py:
image_feature = unpad_image(image_feature, image_sizes[image_idx]), which unpads/pads patches with respect to the original size of the image.

To check what exactly happens. See the output below.

With images of same size resized to (672,672) before passing them to the processor, the output of the code:

image_feature = image_feature.flatten(1, 2).flatten(2, 3)
print(image_feature.shape)    ## ---> torch.Size([4096, 48, 48])
print("Image sizes", image_sizes[image_idx])    ## ---> Image sizes tensor([672, 672], device='cuda:0')
image_feature = unpad_image(image_feature, image_sizes[image_idx])
print(image_feature.shape)    ## ---> torch.Size([4096, 48, 48])

With images of different size, the outputs are:

torch.Size([4096, 48, 48])
Image sizes tensor([585, 640], device='cuda:0')
torch.Size([4096, 44, 48])

torch.Size([4096, 48, 48])
Image sizes tensor([640, 422], device='cuda:0')
torch.Size([4096, 48, 32])

This produces the above stacking error. Looks like there is no other way except resizing the images in the batch before passing them to the processor.

gullalc

Apr 9, 2024

Padding image features with zeros could work, but not sure if that will give intended outputs.

nielsr

Llava Hugging Face org Apr 10, 2024

This feature is planned here: https://github.com/huggingface/transformers/pull/29850

nielsr

Llava Hugging Face org May 21, 2024

Hi folks, happy to share batched generation is now available :)

See: https://x.com/NielsRogge/status/1792952716408369203

nielsr changed discussion status to closed May 21, 2024