google/pix2struct-widget-captioning-base · Missing example for running the model

Apr 11, 2023

This model needs a bounding box to specify which widget to describe.
But there is no example for this on the model card.
What is unclear how the bounding box should be specified.

As I understand the code should look something like this:

model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-widget-captioning-base")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-widget-captioning-base")

question = "? bounding box ?"

inputs = processor(images=image, text=question, return_tensors="pt")

predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))

sunjae1294

Jun 27, 2023

Same issue here.
The model seems to return same caption regardless of the bounding box.

HaiminWang

Oct 13, 2024

Has anyone solved it yet?