Chat template

The problem is that right now the model is trained to generate the string <|im_end|> rather than the EOS token, and it does that imperfectly (sometimes it generates <|im_end without the |> for instance)

senseable

Owner Jan 27, 2024

Yeah, I'm working on v3 where that will be addressed. @ehartford Appreciate the LASER work.

ehartford

Jan 28, 2024

Looking forward to it! (and to finetuning it with Samantha!)

jeffmeloy

Feb 14, 2024

This model seems to work great using the config.json and tokenizer_config.json parameters from this one: https://huggingface.co/NurtureAI/OpenHermes-2.5-Mistral-7B-16k/tree/main

froggeric

Mar 6, 2024

•

edited Mar 6, 2024

@senseable in a closed conversation, you mentioned the chat template was ChatML:

"chat_template": "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"

However, the above is Zephyr prompt format, with the addition of <|im_end|> from what we have seen in practice. The correct format for ChatML is

"chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",

Which one did you actually use for training?

I am working on updating the config json files to fix the eos problem, and I would like to make sure I have the correct format.

senseable

Owner Mar 6, 2024

@froggeric I trained using <im_start|> and <|im_end|> but oddly it probably performs better with Alpaca.

froggeric

Mar 7, 2024

I have done a few tests now using a few different prompt formats (ChatML, Zephyr, Alpaca, Mistral Instruct). I find that using Zephyr instead of ChatML actually often performs betters, and is not affected by the <|im_end|> problem. Alpaca works ok too, but has a few problems with tokens inserted in the converstation. But the best results are when using Mistral Instruct, which is not surprising as it is the underlying foundation; however it suffers the most from token insertion.

Why don't you stick to the Mistral Instruct format for the v3 training? I think the best results should be achieved when using the same format as what was used for the base model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment