Transformers documentation
Multimodal chat templates
Multimodal chat templates
Multimodal chat models accept inputs like images, audio or video, in addition to text. The content
key in a multimodal chat history is a list containing multiple items of different types. This is unlike text-only chat models whose content
key is a single string.
In the same way the Tokenizer class handles chat templates and tokenization for text-only models, the Processor class handles preprocessing, tokenization and chat templates for multimodal models. Their apply_chat_template() methods are almost identical.
This guide will show you how to chat with multimodal models with the high-level ImageTextToTextPipeline and at a lower level using the apply_chat_template() and generate() methods.
ImageTextToTextPipeline
ImageTextToTextPipeline is a high-level image and text generation class with a “chat mode”. Chat mode is enabled when a conversational model is detected and the chat prompt is properly formatted.
Add image and text blocks to the content
key in the chat history.
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "What are these?"},
],
},
]
Create an ImageTextToTextPipeline and pass the chat to it. For large models, setting device_map=“auto” helps load the model quicker and automatically places it on the fastest device available. Setting the data type to auto also helps save memory and improve speed.
import torch
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="Qwen/Qwen2.5-VL-3B-Instruct", device_map="auto", dtype="auto")
out = pipe(text=messages, max_new_tokens=128)
print(out[0]['generated_text'][-1]['content'])
Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
Aside from the gradual descent from pirate-speak into modern American English (it is only a 3B model, after all), this is correct!
Using apply_chat_template
Like text-only models, use the apply_chat_template() method to prepare the chat messages for multimodal models. This method handles the tokenization and formatting of the chat messages, including images and other media types. The resulting inputs are passed to the model for generation.
from transformers import AutoProcessor, AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "What are these?"},
],
},
]
Pass messages
to apply_chat_template() to tokenize the input content. Unlike text models, the output of apply_chat_template
contains a pixel_values
key with the preprocessed image data, in addition to the tokenized text.
processed_chat = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
print(list(processed_chat.keys()))
['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']
Pass these inputs to generate().
out = model.generate(**processed_chat.to(model.device), max_new_tokens=128)
print(processor.decode(out[0]))
The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user.
Video inputs
Some vision models also support video inputs. The message format is very similar to the format for image inputs.
- The content
"type"
should be"video"
to indicate the content is a video. - For videos, it can be a link to the video (
"url"
) or it could be a file path ("path"
). Videos loaded from a URL can only be decoded with PyAV or Decord. - In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you’ve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don’t need to save to files or store it in an URL.
Loading a video from "url"
is only supported by the PyAV or Decord backends.
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "video", "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"},
{"type": "text", "text": "What do you see in this video?"},
],
},
]
Example: Passing decoded video objects
import numpy as np
video_object1 = np.random.randint(0, 255, size=(16, 224, 224, 3), dtype=np.uint8),
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "video", "video": video_object1},
{"type": "text", "text": "What do you see in this video?"}
],
},
]
You can also use existing ("load_video()"
) function to load a video, edit the video in memory and pass it in the messages.
# Make sure a video backend library (pyav, decord, or torchvision) is available.
from transformers.video_utils import load_video
# load a video file in memory for testing
video_object2, _ = load_video(
"https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "video", "video": video_object2},
{"type": "text", "text": "What do you see in this video?"}
],
},
]
Pass messages
to apply_chat_template() to tokenize the input content. There are a few extra parameters to include in apply_chat_template() that controls the sampling process.
The video_load_backend
parameter refers to a specific framework to load a video. It supports PyAV, Decord, OpenCV, and torchvision.
The examples below use Decord as the backend because it is a bit faster than PyAV.
The num_frames
parameter controls how many frames to uniformly sample from the video. Each checkpoint has a maximum frame count it was pretrained with and exceeding this count can significantly lower generation quality. It’s important to choose a frame count that fits both the model capacity and your hardware resources. If num_frames
isn’t specified, the entire video is loaded without any frame sampling.
processed_chat = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
num_frames=32,
video_load_backend="decord",
)
print(processed_chat.keys())
These inputs are now ready to be used in generate().