This model needs a bounding box to specify which widget to describe.
But there is no example for this on the model card.
What is unclear how the bounding box should be specified.
As I understand the code should look something like this:
model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-widget-captioning-base")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-widget-captioning-base")
question = "? bounding box ?"
inputs = processor(images=image, text=question, return_tensors="pt")
predictions = model.generate(**inputs)
Same issue here.
The model seems to return same caption regardless of the bounding box.