Chat template
Exciting model!
What is the chat template format?
@ehartford Thank you, it uses ChatML.
Do you plan to add chatml special tokens to list of tokens? Or replace s /s?
https://huggingface.co/senseable/WestLake-7B-v2/blob/main/tokenizer_config.json
yes - there are no tokens for <|im_start|> and <|im_end|>
to get this working properly, you will need to retrain it with those tokens added, and <|im_end|> designated as the EOS token
if you like I can help you
as theodotus implies - there are two ways.
1: add a new token for <|im_end|> (with a new token id)
2: replace (token id 2) mapping with <|im_end|>
method 1 is easier
method 2 is more difficult but more compatible with merging and clients that are hardcoded to use token_id 2 as EOS
The problem is that right now the model is trained to generate the string <|im_end|> rather than the EOS token, and it does that imperfectly (sometimes it generates <|im_end without the |> for instance)
Looking forward to it! (and to finetuning it with Samantha!)
This model seems to work great using the config.json and tokenizer_config.json parameters from this one: https://huggingface.co/NurtureAI/OpenHermes-2.5-Mistral-7B-16k/tree/main
@senseable in a closed conversation, you mentioned the chat template was ChatML:
"chat_template": "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
However, the above is Zephyr prompt format, with the addition of <|im_end|>
from what we have seen in practice. The correct format for ChatML is
"chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
Which one did you actually use for training?
I am working on updating the config json files to fix the eos problem, and I would like to make sure I have the correct format.
@froggeric I trained using <im_start|> and <|im_end|> but oddly it probably performs better with Alpaca.
I have done a few tests now using a few different prompt formats (ChatML, Zephyr, Alpaca, Mistral Instruct). I find that using Zephyr instead of ChatML actually often performs betters, and is not affected by the <|im_end|>
problem. Alpaca works ok too, but has a few problems with tokens inserted in the converstation. But the best results are when using Mistral Instruct, which is not surprising as it is the underlying foundation; however it suffers the most from token insertion.
Why don't you stick to the Mistral Instruct format for the v3 training? I think the best results should be achieved when using the same format as what was used for the base model.