Update chat template to resemble the prompt as stated in the model card.

#176

While switching backends, we encountered quite severe downgrading in our Mixtral model generation results.
Diving deeper into this issues, we found that the tokenizer relied on the HF model config and used the chat_tempate from there aswell.

For the following input:

messages = [
    {"role": "user", "content":"Hello, how are you?"},
    {"role": "assistant", "content":"Good, how are you?"},
    {"role": "user", "content":"Very good!"}
]

converstion_string = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
)

print(converstion_string) -> "<s>[INST] Hello, how are you? [/INST]Good, how are you?</s>[INST] Very good! [/INST]"

The model card states that this should be

<s> [INST] Hello, how are you? [/INST] Good, how are you?</s> [INST] Very good! [/INST]

Although the difference is very limited (only 2 spaces for a 3 message conversation, the difference in generation results is very big.
With the current implementation we saw very often that the model predicted the <eos> token in unexpected places, especailly since we are generating structured output and the returned structure was therefore not valid. This change has overcome this issue. Furthermore, the generated output is over better quality in terms of expected structured output.

Related to the following issue: https://github.com/vllm-project/vllm/issues/2464

Just wanted to say that this seems to have fixed a lot of the issues I was having with my code.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment