Wrong implementation of processor for multi-image inference

#9
by ruiqiRichard - opened

According to the paper, when inputting multiple images in one conversation, each image should remain instread of being cropped into patches so that number of tokens can be saved.

屏幕截图 2025-01-19 025115.png

However, I noticed that the HF implementation does not seem to consider this case. When inputting a prompt with six images, the processor outputs around 40k tokens, which is not correct. Processor is supposed to output only 729 placeholder tokens for each image. It means that this implmentation is simply extending single-image use case into a multi-image one, so each image is still patched.

These are codes I used:
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
{
"role": "user",
"content": [
{"type": "image"},
] * 6,
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image] * 6, return_tensors='pt').to("cuda:0")

print("number of tokens:", (inputs['input_ids'] == 151646).sum().item())

And the output is:
number of tokens: 39306

Sign up or log in to comment