Image-Text to Text

Inference Providers

Image-Text to Text

Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.

For more details about the image-text-to-text task, check out its dedicated page! You will find examples and related materials.

Recommended models

zai-org/GLM-4.5V: Cutting-edge reasoning vision language model.

Explore all available models and find the one that suits you best here.

Using the API

Language

Client

Provider

Settings

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.environ["HF_TOKEN"],
)

completion = client.chat.completions.create(
    model="CohereLabs/command-a-vision-07-2025:cohere",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in one sentence."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    }
                }
            ]
        }
    ],
)

print(completion.choices[0].message)

API specification

For the API specification of conversational image-text-to-text models, please refer to the Chat Completion API documentation.

Update on GitHub