Transformers documentation

Templates for Chat Models

Transformers

You are viewing v4.35.2 version. A newer version v4.57.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Templates for Chat Models

Introduction

An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string of text (as is the case with a standard language model), the model instead continues a conversation that consists of one or more messages, each of which includes a role as well as message text.

Most commonly, these roles are “user” for messages sent by the user, and “assistant” for messages sent by the model. Some models also support a “system” role. System messages are usually sent at the beginning of the conversation and include directives about how the model should behave in the subsequent chat.

All language models, including models fine-tuned for chat, operate on linear sequences of tokens and do not intrinsically have special handling for roles. This means that role information is usually injected by adding control tokens between messages, to indicate both the message boundary and the relevant roles.

Unfortunately, there isn’t (yet!) a standard for which tokens to use, and so different models have been trained with wildly different formatting and control tokens for chat. This can be a real problem for users - if you use the wrong format, then the model will be confused by your input, and your performance will be a lot worse than it should be. This is the problem that chat templates aim to resolve.

Chat conversations are typically represented as a list of dictionaries, where each dictionary contains role and content keys, and represents a single chat message. Chat templates are strings containing a Jinja template that specifies how to format a conversation for a given model into a single tokenizable sequence. By storing this information with the tokenizer, we can ensure that models get input data in the format they expect.

Let’s make this concrete with a quick example using the BlenderBot model. BlenderBot has an extremely simple default template, which mostly just adds whitespace between rounds of dialogue:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

>>> chat = [
...   {"role": "user", "content": "Hello, how are you?"},
...   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
...   {"role": "user", "content": "I'd like to show off how chat templating works!"},
... ]

>>> tokenizer.apply_chat_template(chat, tokenize=False)
" Hello, how are you?  I'm doing great. How can I help you today?   I'd like to show off how chat templating works!</s>"

Notice how the entire chat is condensed into a single string. If we use tokenize=True, which is the default setting, that string will also be tokenized for us. To see a more complex template in action, though, let’s use the meta-llama/Llama-2-7b-chat-hf model. Note that this model has gated access, so you will have to request access on the repo if you want to run this code yourself:

>> from transformers import AutoTokenizer
>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

>> chat = [
...   {"role": "user", "content": "Hello, how are you?"},
...   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
...   {"role": "user", "content": "I'd like to show off how chat templating works!"},
... ]

>> tokenizer.use_default_system_prompt = False
>> tokenizer.apply_chat_template(chat, tokenize=False)
"<s>[INST] Hello, how are you? [/INST] I'm doing great. How can I help you today? </s><s>[INST] I'd like to show off how chat templating works! [/INST]"

Note that this time, the tokenizer has added the control tokens [INST] and [/INST] to indicate the start and end of user messages (but not assistant messages!)

How do chat templates work?

The chat template for a model is stored on the tokenizer.chat_template attribute. If no chat template is set, the default template for that model class is used instead. Let’s take a look at the template for BlenderBot:


>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

>>> tokenizer.default_chat_template
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ '  ' }}{% endif %}{% endfor %}{{ eos_token }}"

That’s kind of intimidating. Let’s add some newlines and indentation to make it more readable. Note that the first newline after each block as well as any preceding whitespace before a block are ignored by default, using the Jinja trim_blocks and lstrip_blocks flags. However, be cautious - although leading whitespace on each line is stripped, spaces between blocks on the same line are not. We strongly recommend checking that your template isn’t printing extra spaces where it shouldn’t be!

{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ ' ' }}
    {% endif %}
    {{ message['content'] }}
    {% if not loop.last %}
        {{ '  ' }}
    {% endif %}
{% endfor %}
{{ eos_token }}

If you’ve never seen one of these before, this is a Jinja template. Jinja is a templating language that allows you to write simple code that generates text. In many ways, the code and syntax resembles Python. In pure Python, this template would look something like this:

for idx, message in enumerate(messages):
    if message['role'] == 'user':
        print(' ')
    print(message['content'])
    if not idx == len(messages) - 1:  # Check for the last message in the conversation
        print('  ')
print(eos_token)

Effectively, the template does three things:

For each message, if the message is a user message, add a blank space before it, otherwise print nothing.
Add the message content
If the message is not the last message, add two spaces after it. After the final message, print the EOS token.

This is a pretty simple template - it doesn’t add any control tokens, and it doesn’t support “system” messages, which are a common way to give the model directives about how it should behave in the subsequent conversation. But Jinja gives you a lot of flexibility to do those things! Let’s see a Jinja template that can format inputs similarly to the way LLaMA formats them (note that the real LLaMA template includes handling for default system messages and slightly different system message handling in general - don’t use this one in your actual code!)

{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ bos_token + '[INST] ' + message['content'] + ' [/INST]' }}
    {% elif message['role'] == 'system' %}
        {{ '<<SYS>>\\n' + message['content'] + '\\n<</SYS>>\\n\\n' }}
    {% elif message['role'] == 'assistant' %}
        {{ ' '  + message['content'] + ' ' + eos_token }}
    {% endif %}
{% endfor %}

Hopefully if you stare at this for a little bit you can see what this template is doing - it adds specific tokens based on the “role” of each message, which represents who sent it. User, assistant and system messages are clearly distinguishable to the model because of the tokens they’re wrapped in.

How do I create a chat template?

Simple, just write a jinja template and set tokenizer.chat_template. You may find it easier to start with an existing template from another model and simply edit it for your needs! For example, we could take the LLaMA template above and add ”[ASST]” and ”[/ASST]” to assistant messages:

{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }}
    {% elif message['role'] == 'system' %}
        {{ '<<SYS>>\\n' + message['content'].strip() + '\\n<</SYS>>\\n\\n' }}
    {% elif message['role'] == 'assistant' %}
        {{ '[ASST] '  + message['content'] + ' [/ASST]' + eos_token }}
    {% endif %}
{% endfor %}

Now, simply set the tokenizer.chat_template attribute. Next time you use apply_chat_template(), it will use your new template! This attribute will be saved in the tokenizer_config.json file, so you can use push_to_hub() to upload your new template to the Hub and make sure everyone’s using the right template for your model!

template = tokenizer.chat_template
template = template.replace("SYS", "SYSTEM")  # Change the system token
tokenizer.chat_template = template  # Set the new template
tokenizer.push_to_hub("model_name")  # Upload your new template to the Hub!

The method apply_chat_template() which uses your chat template is called by the ConversationalPipeline class, so once you set the correct chat template, your model will automatically become compatible with ConversationalPipeline.

What are “default” templates?

Before the introduction of chat templates, chat handling was hardcoded at the model class level. For backwards compatibility, we have retained this class-specific handling as default templates, also set at the class level. If a model does not have a chat template set, but there is a default template for its model class, the ConversationalPipeline class and methods like apply_chat_template will use the class template instead. You can find out what the default template for your tokenizer is by checking the tokenizer.default_chat_template attribute.

This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. Even when the class template is appropriate for your model, we strongly recommend overriding the default template by setting the chat_template attribute explicitly to make it clear to users that your model has been correctly configured for chat, and to future-proof in case the default templates are ever altered or deprecated.

What template should I use?

When setting the template for a model that’s already been trained for chat, you should ensure that the template exactly matches the message formatting that the model saw during training, or else you will probably experience performance degradation. This is true even if you’re training the model further - you will probably get the best performance if you keep the chat tokens constant. This is very analogous to tokenization - you generally get the best performance for inference or fine-tuning when you precisely match the tokenization used during training.

If you’re training a model from scratch, or fine-tuning a base language model for chat, on the other hand, you have a lot of freedom to choose an appropriate template! LLMs are smart enough to learn to handle lots of different input formats. Our default template for models that don’t have a class-specific template follows the ChatML format, and this is a good, flexible choice for many use-cases. It looks like this:

{% for message in messages %}
    {{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}
{% endfor %}

If you like this one, here it is in one-liner form, ready to copy into your code. The one-liner also includes handy support for “generation prompts” - see the next section for more!

tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

This template wraps each message in <|im_start|> and <|im_end|> tokens, and simply writes the role as a string, which allows for flexibility in the roles you train with. The output looks like this:

<|im_start|>system
You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
I'm doing great!<|im_end|>

The “user”, “system” and “assistant” roles are the standard for chat, and we recommend using them when it makes sense, particularly if you want your model to operate well with ConversationalPipeline. However, you are not limited to these roles - templating is extremely flexible, and any string can be a role.

What are “generation prompts”?

You may notice that the apply_chat_template method has an add_generation_prompt argument. This argument tells the template to add tokens that indicate the start of a bot response. For example, consider the following chat:

messages = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"},
    {"role": "user", "content": "Can I ask a question?"}
]

Here’s what this will look like without a generation prompt, using the ChatML template we described above:

>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
"""

And here’s what it looks like with a generation prompt:

>> tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
"""

Note that this time, we’ve added the tokens that indicate the start of a bot response. This ensures that when the model generates text it will write a bot response instead of doing something unexpected, like continuing the user’s message. Remember, chat models are still just language models - they’re trained to continue text, and chat is just a special kind of text to them! You need to guide them with the appropriate control tokens so they know what they’re supposed to be doing.

Not all models require generation prompts. Some models, like BlenderBot and LLaMA, don’t have any special tokens before bot responses. In these cases, the add_generation_prompt argument will have no effect. The exact effect that add_generation_prompt has will depend on the template being used.

I want to use chat templates! How should I get started?

If you have any chat models, you should set their tokenizer.chat_template attribute and test it using apply_chat_template(). This applies even if you’re not the model owner - if you’re using a model with an empty chat template, or one that’s still using the default class template, please open a pull request to the model repository so that this attribute can be set properly!

Once the attribute is set, that’s it, you’re done! tokenizer.apply_chat_template will now work correctly for that model, which means it is also automatically supported in places like ConversationalPipeline!

By ensuring that models have this attribute, we can make sure that the whole community gets to use the full power of open-source models. Formatting mismatches have been haunting the field and silently harming performance for too long - it’s time to put an end to them!

Template writing tips

If you’re unfamiliar with Jinja, we generally find that the easiest way to write a chat template is to first write a short Python script that formats messages the way you want, and then convert that script into a template.

Remember that the template handler will receive the conversation history as a variable called messages. Each message is a dictionary with two keys, role and content. You will be able to access messages in your template just like you can in Python, which means you can loop over it with {% for message in messages %} or access individual messages with, for example, {{ messages[0] }}.

You can also use the following tips to convert your code to Jinja:

For loops

For loops in Jinja look like this:

{% for message in messages %}
{{ message['content'] }}
{% endfor %}

Note that whatever’s inside the {{ expression block }} will be printed to the output. You can use operators like + to combine strings inside expression blocks.

If statements

If statements in Jinja look like this:

{% if message['role'] == 'user' %}
{{ message['content'] }}
{% endif %}

Note how where Python uses whitespace to mark the beginnings and ends of for and if blocks, Jinja requires you to explicitly end them with {% endfor %} and {% endif %}.

Special variables

Inside your template, you will have access to the list of messages, but you can also access several other special variables. These include special tokens like bos_token and eos_token, as well as the add_generation_prompt variable that we discussed above. You can also use the loop variable to access information about the current loop iteration, for example using {% if loop.last %} to check if the current message is the last message in the conversation. Here’s an example that puts these ideas together to add a generation prompt at the end of the conversation if add_generation_prompt is True:

{% if loop.last and add_generation_prompt %}
{{ bos_token + 'Assistant:\n' }}
{% endif %}

Notes on whitespace

As much as possible, we’ve tried to get Jinja to ignore whitespace outside of {{ expressions }}. However, be aware that Jinja is a general-purpose templating engine, and it may treat whitespace between blocks on the same line as significant and print it to the output. We strongly recommend checking that your template isn’t printing extra spaces where it shouldn’t be before you upload it!

←Share a custom model Run training on Amazon SageMaker→