whether special instruction is need to trigger OCR location function?

#38
by liupei0408 - opened

as mentioned above, whether special instruction is need for OCR location feature using Fuyu-8b to get same result as showing in blog?

Hi @liupei0408 , @Nooodles : you can try this from the new release of transformers! @pcuenq worked on the bbox postprocessing, you can localise text by doing:

from PIL import Image
import requests
import io
from transformers import FuyuForCausalLM, FuyuProcessor

pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')

bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n Williams"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))

model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')

outputs = model.generate(**model_inputs, max_new_tokens=10)
post_processed_bbox_tokens = processor.post_process_box_coordinates(outputs)[0]
model_outputs = processor.decode(post_processed_bbox_tokens, skip_special_tokens=True)
prediction = model_outputs.split('\x04', 1)[1] if '\x04' in model_outputs else ''

prediction will output the coordinates of the text Williams in the image.

This comment has been hidden

Sign up or log in to comment