Text Generation
Transformers
Safetensors
mistral
conversational
Inference Endpoints
text-generation-inference

ChatML format

#1
by andysalerno - opened

the model card says this:

I don't really understand the point of having special tokens for <|im_start|> and <|im_end|>, because in practice they just act as BOS and EOS tokens (but, please correct me if I'm wrong).

I'm definitely no expert on this topic, but I have a thought to share.

From the OpenChat paper, they say this:

To differentiate speakers, we introduce a new <|end of turn|> special token at the end of each
utterance, following Zhou et al. (2023). The <|end of turn|> token functions similarly to the
EOS token for stopping generation while preventing confusion with the learned meaning of EOS
during pretraining.

https://arxiv.org/pdf/2309.11235.pdf

In my (non expert) opinion, it makes sense to use a dedicated token for the end of turn, different from EOS, if only because OpenChat and others do it (and OpenChat is a really, really great finetune). And if you just use standard ChatML, then it has the added benefit that any API, library, code, or caller that knows the standard ChatML format could simply consume the model without any changes.

OpenChat doesn't use ChatML:
https://huggingface.co/openchat/openchat_3.5/blob/main/tokenizer_config.json#L51
https://github.com/lm-sys/FastChat/blob/ec9a07ed22110e9686b51fd6ee9bf635b7ce54f8/fastchat/conversation.py#L542

Many other popular models do, e.g. OpenHermes, Dolphin, etc., but they are just changing the stop tokens and adding new special tokens after BOS for some reason, which have the exact same purpose:
https://github.com/lm-sys/FastChat/blob/ec9a07ed22110e9686b51fd6ee9bf635b7ce54f8/fastchat/conversation.py#L1067

The OpenChat discusses performance differences between only having a turn differentiator once vs after each role, but doesn't test the difference between re-using the existing special tokens - I suspect the performance would be identical. They say "... prevent confusion with the learned meaning of EOS..." but I don't actually see evidence that there is confusion in this paper or some of the referenced papers, but perhaps I missed it.

Sign up or log in to comment