HuggingFaceM4/idefics-9b-instruct · Inference on HF Endpoints API?

I have this model running on the Endpoints API, but I can't get it to accept BOTH text and image inputs simultaneously.

What is the required schema?

I also asked here: https://github.com/huggingface/api-inference-community/issues/336

I got close, but it seems it only accepts a single string as input because it's part of the "Text-Generation" family of models.

import json
import requests
import base64

img_url = "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG"
API_URL = "https://api-inference.huggingface.co/models/HuggingFaceM4/idefics-9b-instruct" 
headers = {
    "Authorization": f"Bearer {HF_TOKEN}",
    "Content-Type": "application/json"
}

def query(image_url):
    response = requests.get(image_url)
    image_bytes = response.content
    encoded_image = base64.b64encode(image_bytes).decode('utf-8')
    data = {
        "inputs": img_url,
            # "prompt": "What's in this image?",
            # "prompt": encoded_image
        # }
        # "image": encoded_image,
        # "inputs": "What's in this image?",
    }
    json_data = json.dumps(data)
    print("my request", json_data)
    response = requests.request("POST", API_URL, headers=headers, data=json_data)
    print("Response content:", response.content)
    return json.loads(response.content.decode("utf-8"))

print(query(img_url))

## The results seem nearly there! 
## [{'generated_text': 'https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPGScooby-Doo, Where Are You!'}]

def prompt_list_to_tgi_input(prompt_list: List[str]) -> str: """ TGI expects a string that contains both text and images in the image markdown format (i.e. the `![]()` ). The images links are parsed on TGI side """ result_string_input = "" for elem in prompt_list: if is_image(elem): if is_url(elem): result_string_input += f"![]({elem})" else: result_string_input += f"![]({gradio_link(img_path=elem)})" else: result_string_input += elem return result_string_input