For the vqav2 data set example "fish and carrot", why does the model output a sentence instead of a phrase?

#44
by changgeli - opened

My output:"The image features two plates of food, each containing different types of food. One plate contains fish, carrots, and potatoes, while the other plate contains fish, lemon, and chicken."

The blog example:"fish, carrots"

Is there something wrong with the way I'm using it?
'''
text_prompt = "What type of foods are in the image?\n"
image_path = "pictures/fish_and_carrots.png" # https://huggingface.co/adept-hf-collab/fuyu-8b/blob/main/chart.png
image_pil = Image.open(image_path)

model_inputs = processor(text=text_prompt, images=[image_pil], device="cuda:0")
for k, v in model_inputs.items():
model_inputs[k] = v.to("cuda:0")

generation_output = model.generate(**model_inputs, max_new_tokens=50)
generation_text = processor.batch_decode(generation_output[:, -50:], skip_special_tokens=True)
print("generation_text:", generation_text)
'''

Hi @changgeli ! With the new transformers release, fuyu-8b's processor has been updated and we've also got more information on prompt structure. You have to prompt it to answer in a VQA fashion.
here, try this:

from PIL import Image
import requests
import io
from transformers import FuyuForCausalLM, FuyuProcessor


pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')


text_prompt = "Answer the following VQAv2 question based on the image: What type of foods are in the image?"
fish_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/fish_carrots.png"
fish_image_pil = Image.open(io.BytesIO(requests.get(fish_image_url).content))
model_inputs = processor(text=text_prompt, images=fish_image_pil).to('cuda')


model_outputs = processor.batch_decode(model.generate( **model_inputs, max_new_tokens=10)[:, -10:], skip_special_tokens=True)[0]
prediction = model_outputs.split('\x04 ', 1)[1] if '\x04' in model_outputs else ''

You should see "fish, carrots, lemon" as a prediction.

Thank you for the comprehensive reply! Are there any recommended prompts to use for other datasets or tasks?

You're welcome :) that I know of, to best match the blog post examples, you can try "Answer the following DocVQA question based on the image." or the one I shared above, seems to match well!

from PIL import Image
import requests
import io
from transformers import FuyuForCausalLM, FuyuProcessor

pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')


text_prompt = "Answer the following DocVQA question based on the image. \n Which is the metro in California that has a good job Outlook?"
jobs_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/jobs.png"
jobs_image_pil = Image.open(io.BytesIO(requests.get(jobs_image_url).content))

second_text_prompt = "Answer the following DocVQA question based on the image. \n What if the maximum male life expectancy?"
chart_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/chart.png"
chart_image_pil = Image.open(io.BytesIO(requests.get(chart_image_url).content))

third_text_prompt = "Answer the following DocVQA question based on the image. \n What sport is that?"
skate_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/skateboard.png"
skate_image_pil = Image.open(io.BytesIO(requests.get(skate_image_url).content))

fourth_text_prompt = "Answer the following DocVQA question based on the image. \n What was the fair amount of paid vacation days in the United Kingdom?"
vacations_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/vacation_days_hr.png"
vacations_image_pil = Image.open(io.BytesIO(requests.get(vacations_image_url).content)).convert('RGB')

fifth_text_prompt = "Answer the following VQAv2 question based on the image: What type of foods are in the image?"
fish_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/fish_carrots.png"
fish_image_pil = Image.open(io.BytesIO(requests.get(fish_image_url).content))


texts = [text_prompt, second_text_prompt, third_text_prompt, fourth_text_prompt, fifth_text_prompt]
images = [jobs_image_pil, chart_image_pil, skate_image_pil, vacations_image_pil, fish_image_pil]

model_inputs = processor(text=texts, images=images).to('cuda')


model_outputs = processor.batch_decode(model.generate(
    **model_inputs, max_new_tokens=10)[:, -10:], skip_special_tokens=True)

ground_truths = ['Los Angeles', '80.7', 'skateboarding', '28', 'fish, carrots, lemon']


for ground_truth, model_output in zip(ground_truths, model_outputs):
    prediction = model_output.split('\x04 ', 1)[1] if '\x04' in model_output else ''
    assert (ground_truth == prediction)

@changgeli And I forgot to include bounding box localisation and guided OCR: for that, you can do


# bbox contents prediction (OCR)

pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')


bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n<box>388, 428, 404, 488</box>"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))
model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')


model_outputs = processor.batch_decode(model.generate(
    **model_inputs, max_new_tokens=10)[:, -10:], skip_special_tokens=True)[0]
prediction = model_outputs.split('\x04', 1)[1] if '\x04' in model_outputs else ''

# bbox localisation from text

pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')


bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n Williams"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))
model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')

outputs = model.generate(**model_inputs, max_new_tokens=10)
post_processed_bbox_tokens = processor.post_process_box_coordinates(outputs)[0]
model_outputs = processor.decode(post_processed_bbox_tokens, skip_special_tokens=True)
prediction = model_outputs.split('\x04', 1)[1] if '\x04' in model_outputs else ''

The last example "bbox localisation from text" seems to not output the '\x04' symbol, and cannot be parsed.

Hello @changgeli ! Are you using the same input image and prompt? Can you verify what's the decoded result before thesplit?

Thank you for reply! The symbol '\x04' can be output. I used '\x04 ' instead of '\x04' as a delimiter, so the parsing failed.
My question is, why do some predictions start with a space, while others do not?

the decoded result of bbox localisation from text:
...||SPEAKER||SPEAKER||SPEAKER||NEWLINE|<s> When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n Williams\x04<box>388, 428, 404, 900</box>'

the decoded result of bbox contents prediction (OCR):
...||SPEAKER||SPEAKER||SPEAKER||NEWLINE|<s> When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n<box>388, 428, 404, 488</box>\x04 Williams'

Did you try tweaking the generation parameters? I think we mostly tried greedy decoding, sampling might change what you encounter.

Sign up or log in to comment