api-inference documentation

Image-Text to Text

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Image-Text to Text

Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.

For more details about the image-text-to-text task, check out its dedicated page! You will find examples and related materials.

Recommended models

Explore all available models and find the one that suits you best here.

Using the API

Python
JavaScript
cURL
import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct"
headers = {"Authorization": "Bearer hf_***"}

from huggingface_hub import InferenceClient

client = InferenceClient(api_key="hf_***")

image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"

for message in client.chat_completion(
	model="meta-llama/Llama-3.2-11B-Vision-Instruct",
	messages=[
		{
			"role": "user",
			"content": [
				{"type": "image_url", "image_url": {"url": image_url}},
				{"type": "text", "text": "Describe this image in one sentence."},
			],
		}
	],
	max_tokens=500,
	stream=True,
):
	print(message.choices[0].delta.content, end="")

To use the Python client, see huggingface_hub’s package reference.

API specification

For the API specification of conversational image-text-to-text models, please refer to the Chat Completion API documentation.

< > Update on GitHub