llava-hf/llava-v1.6-vicuna-7b-hf · Handling multiple images of different dimensions in the same call

Hello. Thank you for releasing the hf versions of llava v1.6!!

When I try to generate an output that includes multiple image tokens in the prompt for llava-hf/llava-1.5-7b-hf, I am able to get the output, even when the two images are of different dimensions. When running the same for llava-hf/llava-v1.6-vicuna-7b-hf, it works perfectly when all images are of the same dimensions, but when the dimensions are not the same it throws the following error stack expects each tensor to be equal size, but got [2144, 4096] at entry 0 and [2340, 4096] at entry 1. Will this be handled in a later version or the images needs to be preprocessed to have the same dimensions always? The same goes for llava-hf/llava-v1.6-vicuna-13b-hf and llava-hf/llava-v1.6-mistral-7b-hf

Here is the code snippet that I am running:

from transformers import AutoProcessor, AutoModelForVision2Seq
import requests
from PIL import Image

image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

# model_id = "llava-hf/llava-1.5-7b-hf"
model_id = "llava-hf/llava-v1.6-vicuna-7b-hf"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id, device_map="auto", load_in_4bit=True)

prompts = '''USER <image>\nAre there any cats in the image? Answer with only "Yes" or "No". ASSISTANT: No\nUSER <image>\nAre there any cats in the image? Answer with only "Yes" or "No". ASSISTANT:'''

#inputs = processor(prompts, images=[image1, image1], padding=True, return_tensors="pt").to("cuda")
#inputs = processor(prompts, images=[image2, image1], padding=True, return_tensors="pt").to("cuda")
inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=200)
generated_text = processor.batch_decode(output, skip_special_tokens=True)
print(generated_text)