Can PaliGemma answer multiple questions about a single image?

#6
by makemecker - opened

Hello everyone,

I am interested in knowing if it is possible to ask multiple questions about a single image using this model.

Specifically, I am looking to:

Input an image into the model.
Ask several different questions related to the content of the image.
Receive accurate and contextually relevant answers for each question.
Has anyone tried this before? If so, could you please share your experience and any sample code or guidelines on how to achieve this? Any tips on optimizing the performance for such tasks would also be highly appreciated.

Thank you in advance for your help!

Google org

Here's how you can achieve this effectively (You can find example code for both approaches in this gist):

car.jpg

  1. Leverage Batch Processing:
prompts = [
    'What is the year, make, and model of the car in the image?\n',
    'What color is the house in the background, and how many doors you see?\n',
    'What color is the doors?\n',
]
images = [image] * len(prompts)  # Assuming the same image for all questions

model_inputs = processor(text=prompts, images=images, return_tensors="pt", padding=True, truncation=True).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]  # Get input length

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    # Assuming the model returns outputs for each prompt in the batch
    generated_texts = generation[:, input_len:]
    decoded_texts = [processor.decode(text, skip_special_tokens=True) for text in generated_texts]

for i, (prompt, decoded_text) in enumerate(zip(prompts, decoded_texts)):
    print(f"Question: {prompt}")
    print(f"Answer: {decoded_text}\n")

Output:

Question: What is the year, make, and model of the car in the image?
Answer: The year is 1965, the make is a Volkswagen Beetle, and the model is a 1965 Volkswagen Beetle.

Question: What color is the house in the background, and how many doors you see?
Answer: yellow 2

Question: What color is the doors?
Answer: brown
  1. Iteration
prompts = [
    'What is the year, make, and model of the car in the image?\n',
    'What color is the house in the background, and how many doors you see?\n',
    'What color is the doors?\n',
]
images = [image] * len(prompts)  # Assuming the same image for all questions

for prompt, image_data in zip(prompts, images):
  model_inputs = processor(text=[prompt], images=[image_data], return_tensors="pt", padding=True).to(model.device)
  input_len = model_inputs["input_ids"].shape[-1]

  with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(f"Question: {prompt}")
    print(f"Answer: {decoded}\n")

Output:

Question: What is the year, make, and model of the car in the image?
Answer: The year is 1965, the make is a Volkswagen Beetle, and the model is a 1965 Volkswagen Beetle.

Question: What color is the house in the background, and how many doors you see?
Answer: yellow 2

Question: What color is the doors?
Answer: brown

Hello @selamw ,

Thank you so much for your detailed and insightful response to my question!)

makemecker changed discussion status to closed

Sign up or log in to comment