how can i perform batch inference for llama-3.2 vision.

#68
by AnanyaDB - opened

I am trying to perform batch inference on a list of images , but it's constantly getting OOM error. what's the best way to perform batch inference for this model ? any code reference is appreciated.

Below is my code for batch to test:

Process inputs in batch

batch_inputs = llama_processor(images=[image_list[0],image_list[1]],
text=input_text_list[0:2],
return_tensors="pt",
padding=True).to(llama_model.device)

input_text_list is a list of prompts
image_list is a list of images

The response are coming as garbage for the second one, only the first output is correct. What am i doing wrong?

"udes sensitive information such as an email address or physical address. Is there anything else I can help you with?<|eot_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_"

Have the same issue:
From https://github.com/meta-llama/llama-recipes/issues/701
I see: Our recommendation is to run inference with 1 image at a time, you might see a degradation in response quality with more images.
even if degradation in quality is not a deal breaker in a batched inference I implemented the model apparently does not see other images:
OUTPUT:
{
'006aaee9-7849-47a5-ad14-0bfa4f8decf8.png': {
'Llama_3_2_11B': "user\n\nCan you please briefly describe this image?assistant\n\nThe image depicts a group of people ascending a white airplane's stairs, set against a gray sky. The individuals ...."
},
'ed4a423b-30ff-4c60-b00f-50905a3d417c.png': {
'Llama_3_2_11B': 'user\n\nCan you please briefly describe this image?assistant\n\nI don’t have enough information to describe the image. Can you please provide more context or information about the image?'
},
'384620ec-bb40-4025-80bb-57ab6b446c8a.png': {
'Llama_3_2_11B': "user\n\nCan you please briefly describe this image?assistant\n\nUnfortunately, I don't have the ability to see or access images. You can describe the image to me, and I'll do my best to help! Is it an image of a famous person, a sports event, or something else?"
},
'1930e41a-475a-4845-b37b-5829931d4b27.png': {
'Llama_3_2_11B': "user\n\nCan you please briefly describe this image?assistant\n\nUnfortunately, I don't have the ability to see or access images. You can describe the image to me, and I'll do my best to help. What's the image of? Is it a pool, and if so, what's its condition?"
},
'473e07a0-5d0b-4bfb-af75-a3ba1436d597.png': {
'Llama_3_2_11B': "user\n\nCan you please briefly describe this image?assistant\n\nUnfortunately, I don't have the ability to see or access images. You can describe the image to me, and I'll do my best to help! What's the image of? A woman at a concert or festival, perhaps?"
}
}

exactly , the inference is only correct for first image and second onwards it does not give any result.
This is a deal breaker for many batch use cases where you have to process images at scale. is there any roadmap on batch processing ?

Sign up or log in to comment