Multimodal chat templates

Multimodal chat models accept inputs like images, audio or video, in addition to text. The content key in a multimodal chat history is a list containing multiple items of different types. This is unlike text-only chat models whose content key is a single string.

In the same way the Tokenizer class handles chat templates and tokenization for text-only models, the Processor class handles preprocessing, tokenization and chat templates for multimodal models. Their apply_chat_template() methods are almost identical.

This guide will show you how to chat with multimodal models with the high-level ImageTextToTextPipeline and at a lower level using the apply_chat_template() and generate() methods.

ImageTextToTextPipeline

ImageTextToTextPipeline is a high-level image and text generation class with a “chat mode”. Chat mode is enabled when a conversational model is detected and the chat prompt is properly formatted.

Add image and text blocks to the content key in the chat history.

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "What are these?"},
        ],
    },
]

Create an ImageTextToTextPipeline and pass the chat to it. For large models, setting device_map=“auto” helps load the model quicker and automatically places it on the fastest device available. Setting the data type to auto also helps save memory and improve speed.

import torch
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen2.5-VL-3B-Instruct", device_map="auto", dtype="auto")
out = pipe(text=messages, max_new_tokens=128)
print(out[0]['generated_text'][-1]['content'])

Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?

Aside from the gradual descent from pirate-speak into modern American English (it is only a 3B model, after all), this is correct!

Using apply_chat_template

Like text-only models, use the apply_chat_template() method to prepare the chat messages for multimodal models. This method handles the tokenization and formatting of the chat messages, including images and other media types. The resulting inputs are passed to the model for generation.

from transformers import AutoProcessor, AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", device_map="auto", torch_dtype="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

messages = [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "What are these?"},
        ],
    },
]

Pass messages to apply_chat_template() to tokenize the input content. Unlike text models, the output of apply_chat_template contains a pixel_values key with the preprocessed image data, in addition to the tokenized text.

processed_chat = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
print(list(processed_chat.keys()))

['input_ids', 'attention_mask', 'pixel_values', 'image_grid_thw']

Pass these inputs to generate().

out = model.generate(**processed_chat.to(model.device), max_new_tokens=128)
print(processor.decode(out[0]))

The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user.

Video inputs

Some vision models also support video inputs. The message format is very similar to the format for image inputs.

The content "type" should be "video" to indicate the content is a video.
For videos, it can be a link to the video ("url") or it could be a file path ("path"). Videos loaded from a URL can only be decoded with PyAV or Decord.
In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if you’ve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don’t need to save to files or store it in an URL.

Loading a video from "url" is only supported by the PyAV or Decord backends.

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "video", "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"},
            {"type": "text", "text": "What do you see in this video?"},
        ],
    },
]

Example: Passing decoded video objects

import numpy as np

video_object1 = np.random.randint(0, 255, size=(16, 224, 224, 3), dtype=np.uint8),

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": video_object1},
            {"type": "text", "text": "What do you see in this video?"}
        ],
    },
]

You can also use existing ("load_video()") function to load a video, edit the video in memory and pass it in the messages.


# Make sure a video backend library (pyav, decord, or torchvision) is available.
from transformers.video_utils import load_video

# load a video file in memory for testing
video_object2, _ = load_video(
    "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": video_object2},
            {"type": "text", "text": "What do you see in this video?"}
        ],
    },
]

Pass messages to apply_chat_template() to tokenize the input content. There are a few extra parameters to include in apply_chat_template() that controls the sampling process.

The video_load_backend parameter refers to a specific framework to load a video. It supports PyAV, Decord, OpenCV, and torchvision.

The examples below use Decord as the backend because it is a bit faster than PyAV.

fixed number of frames

fps

list of image frames

< > Update on GitHub

Transformers

Multimodal chat templates

ImageTextToTextPipeline

Using apply_chat_template

Video inputs

Example: Passing decoded video objects