apply_chat_template result of Mistral is not restrictly align to the template on its website

#53
by Annorita - opened

(This is a crossposting of this issue)


from transformers import AutoTokenizer
chat = [
  {"role": "user", "content": "USER_INSTRUCTION_1"},
  {"role": "assistant", "content": "RESPONSE_1"},
  {"role": "user", "content": "USER_INSTRUCTION_2"},
  {"role": "assistant", "content": "RESPONSE_2"},
]
res_apply_chat_template = tokenizer.apply_chat_template(chat, tokenize=False)
res_mistral_website = '<s>[INST] USER_INSTRUCTION_1 [/INST] RESPONSE_1</s>[INST] USER_INSTRUCTION_2 [/INST] RESPONSE_2</s>'
print(res_apply_chat_template)   
print(res_mistral_website)

The result is:

'<s>[INST] USER_INSTRUCTION_1 [/INST]RESPONSE_1</s> [INST] USER_INSTRUCTION_2 [/INST]RESPONSE_2</s> '
'<s>[INST] USER_INSTRUCTION_1 [/INST] RESPONSE_1</s>[INST] USER_INSTRUCTION_2 [/INST] RESPONSE_2</s>'

There are two main difference:

  1. According to the Mistral 7B website: https://docs.mistral.ai/usage/guardrailing/#appendix
    There is always a blank after [INST] and [/INST], but result of apply_chat_template seems not following it.

  2. In res_apply_chat_template, There is an additional blank at the end of a turn.

I also encode the two sentences and decode them back. The results show that the word will be tokenized into different tokens because of the blank after [/INST]:

decoded_apply_chat_template = []
for a in ids_apply_chat_template:
    decoded_apply_chat_template.append(tokenizer.decode(a))

decoded_mistral_website = []
for b in ids_mistral_website:
    decoded_mistral_website.append(tokenizer.decode(b))

#decoded_apply_chat_template
['<s>', '[', 'INST', ']', 'US', 'ER', '_', 'IN', 'STRU', 'CTION', '_', '1', '[', '/', 'INST', ']', 'RE', 'SP', 'ON', 'SE', '_', '1', '</s>', '', '[', 'INST', ']', 'US', 'ER', '_', 'IN', 'STRU', 'CTION', '_', '2', '[', '/', 'INST', ']', 'RE', 'SP', 'ON', 'SE', '_', '2', '</s>', ' ']

#decoded_mistral_website
['<s>', '[', 'INST', ']', 'US', 'ER', '_', 'IN', 'STRU', 'CTION', '_', '1', '[', '/', 'INST', ']', 'RES', 'P', 'ON', 'SE', '_', '1', '</s>', '[', 'INST', ']', 'US', 'ER', '_', 'IN', 'STRU', 'CTION', '_', '2', '[', '/', 'INST', ']', 'RES', 'P', 'ON', 'SE', '_', '2', '</s>']

I guess it's okay to do either way, but shall we better to align with how it was been done during finetuning?

cc @teven @LucileSaulnier to this one! I wrote the chat template based on the example in the model card, but it seems like that might differ from the example on the website. Can you confirm which template is correct?

Sign up or log in to comment