Batch processing expects images of same size

#10
by gullalc - opened

On processing a batch of images with the same prompt, if the images are of different size, the stacking of image features throws an error

transformers/models/llava_next/modeling_llava_next.py", line 553, in forward
image_features = torch.stack(new_image_features, dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [2732, 4096] at entry 0 and [2160, 4096] at entry 1

Just want to know if this is intentional, because llava-v1.5 has no such issues and accepts images of arbitrary sizes in a batch.

Digging it a bit further, the preprocessing works as expected, takes 5 crops from each image of the following aspect ratios:
[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008] --> line 185 in image_processing_llava_next.py

Each crop is then of size (3, 336, 336) after preprocessing.

The issue is because of this line in modelling_llava_next.py:
image_feature = unpad_image(image_feature, image_sizes[image_idx]), which unpads/pads patches with respect to the original size of the image.

To check what exactly happens. See the output below.

With images of same size resized to (672,672) before passing them to the processor, the output of the code:

image_feature = image_feature.flatten(1, 2).flatten(2, 3)
print(image_feature.shape)    ## ---> torch.Size([4096, 48, 48])
print("Image sizes", image_sizes[image_idx])    ## ---> Image sizes tensor([672, 672], device='cuda:0')
image_feature = unpad_image(image_feature, image_sizes[image_idx])
print(image_feature.shape)    ## ---> torch.Size([4096, 48, 48])

With images of different size, the outputs are:

torch.Size([4096, 48, 48])
Image sizes tensor([585, 640], device='cuda:0')
torch.Size([4096, 44, 48])

torch.Size([4096, 48, 48])
Image sizes tensor([640, 422], device='cuda:0')
torch.Size([4096, 48, 32])

This produces the above stacking error. Looks like there is no other way except resizing the images in the batch before passing them to the processor.

Padding image features with zeros could work, but not sure if that will give intended outputs.

Llava Hugging Face org

Sign up or log in to comment