Error: If you are using a VisionEncoderDecoderModel, you must provide a feature extractor

#16
by aaly - opened

While trying to implement a serverless instance of this model, I got the following output:

{'error': 'If you are using a VisionEncoderDecoderModel, you must provide a feature extractor', 'warnings': ['There was an inference error: If you are using a VisionEncoderDecoderModel, you must provide a feature extractor']}

My source code:

import requests
import base64

API_URL = "https://api-inference.huggingface.co/models/naver-clova-ix/donut-base-finetuned-docvqa"
headers = {"Authorization": "Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}

def query(payload):
    with open(payload["inputs"]["image"], "rb") as f:
        img = f.read()
        payload["inputs"]["image"] = base64.b64encode(img).decode("utf-8")
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": {
        "image": "image.jpg",
        "question": "What is in this image?"
    },
})

print(output)

From GPT4:
Your source code aims to send a question about an image to a pre-trained model on Hugging Face using the API, but it's missing a crucial component for the Donut model, which involves the preprocessing of images through a feature extractor. The error message you encountered suggests that the model expects image data to be preprocessed by a feature extractor before it's passed to the model.

To debug and adjust your code, you'll need to include the use of a feature extractor compatible with the Donut model. Unfortunately, directly implementing this in your existing code snippet is not straightforward because it involves additional steps like integrating a feature extraction library compatible with the Donut model and preprocessing the image accordingly before encoding it in base64.

Here's a high-level approach to adjust your implementation:

Integrate a Feature Extractor: Identify and integrate a feature extractor that is compatible with the Donut model. This might involve using additional libraries or tools provided by Hugging Face or related to the Swin Transformer or BART model architectures.

Preprocess the Image: Use the feature extractor to preprocess your image before encoding it. This preprocessing will transform the image into a suitable format that the model expects.

Adjust the Payload: After preprocessing and encoding the image, adjust your payload to include the processed image data in the format expected by the API.

Since the exact feature extractor and its implementation details depend on the Donut model's requirements, you would need to refer to the model's documentation on Hugging Face or the original paper for specific guidance on the compatible feature extraction process.

If your serverless environment allows, consider integrating the necessary libraries for image preprocessing and ensure that the payload sent to the API matches the expected format after feature extraction.

So basically...you can't throw a raw image at Donut and expect an output, you need to integrate a feature extractor compatible with Donut, either with the use of other pre-trained models or libraries that process images in a way that Donut can understand.

Probably check the research paper for more info on that...or wait till I'm done reading it and give you feed back

Okay, so Donut has a built in processor for your image input...

Screenshot from 2024-04-03 05-01-54.png
https://huggingface.co/docs/transformers/main/en/model_doc/donut#inference-examples

You should use this to preprocess your image file(s) before passing it onto Donut.
Best of Luck

Sign up or log in to comment