Update Chat Template - V3 Tokenizer

#37
Mistral AI_ org

Using mistral-common as ground truth, updating the chat template to follow the same as MistralTokenizer.v3.
This new chat template should format in the following way:

<s>[INST] Hello[/INST] Hi there!</s>[INST] How are you?[/INST] Fine and you?</s>[INST] Fine thank you.[/INST] 

For reference, here is the output from MistralTokenizer.v3:

<s>[INST]▁Hello[/INST]▁Hi▁there!</s>[INST]▁How▁are▁you?[/INST]▁Fine▁and▁you?</s>[INST]▁Fine▁thank▁you.[/INST]

The use should be consistent with the logic in Mixtral-8x22B-Instruct-v0.1. The logic should be consistent with the logic in Mixtral-8x22B-Instruct-v0, which notifies support for chat and tools as follows:

{
  "chat_template": [
  {
    "name": "default",
    "template": "{{bos_token}}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ ' [INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + ' ' + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"
  },
  {
    "name": "tool_use",
    "template": "{{bos_token}}{% set user_messages = messages | selectattr('role', 'equalto', 'user') | list %}{% for message in messages %}{% if message['role'] == 'user' %}{% if message == user_messages[-1] %}{% if tools %}{{'[AVAILABLE_TOOLS]'+ tools|string + '[/AVAILABLE_TOOLS]'}}{% endif %}{{ '[INST]' + message['content'] + '[/INST]' }}{% else %}{{ '[INST]' + message['content'] + '[/INST]' }}{% endif %}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + ' ' + eos_token}}{% elif message['role'] == 'tool_results' %}{{'[TOOL_RESULTS]' + message['content']|string + '[/TOOL_RESULTS]'}}{% elif message['role'] == 'tool_calls' %}{{'[TOOL_CALLS]' + message['content']|string + eos_token}}{% endif %}{% endfor %}"
  }
  ],
  "clean_up_tokenization_spaces": false,
}
Mistral AI_ org

Actually im going to close this PR cause the issue goes deeper than what I thought, I will soon make a PR to each repo with a clean chat template and explanations.

pandora-s changed pull request status to closed

Sign up or log in to comment