Dec 13, 2023

I do as readme, run this model as the same way of google/pix2struct-textcaps-base.

import requests
from PIL import Image
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

#url = "https://www.ilankelman.org/stopsigns/australia.jpg"
url = "http://nos.netease.com/ct-test/0cfec85e-f220-45a7-9a22-2a4353faabd6.png"
image = Image.open(requests.get(url, stream=True).raw)

model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-widget-captioning-large")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-widget-captioning-large")
processor.image_processor.is_vqa = False

image only

inputs = processor(images=image, return_tensors="pt")

predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))

However, I get the response 'send message', instead of the xml structure of screencap.

ArthurZ

Google org Dec 13, 2023

cc @ybelkada

123Lim changed discussion title from Messing a sample to describe the way to use this model. to Missing a sample to describe the way to use this model. Dec 13, 2023