Model follows ChatML format, but does not have the special tokens for ChatML

#3
by andysalerno - opened

This model seems to use ChatML, but doesn't have the <|im_end|> special token.

Honestly this might not be a big problem. But it raises some interesting questions.

  1. Do models have a harder time learning the concept of "stopping" when they are required to track a sequence of multiple tokens, i.e. ['<', '|', 'im', '_', end', '|', '>']
  2. Does this introduce a problem where certain text sequences can make ambiguous representation of the stop string? I.e. if a model's output ends with <, then this tokenizes as ['<<', '|', 'im', '_', end', '|', '>'], and note how this no longer matches the tokenization of the expected stop sequence, because two < in a row get tokenized as << instead of ['<', '<']. That's just one example, I suppose there could be more. I guess this only has an impact during training, and in rare cases?

I only bring this up because I am curious if it introduces subtle problems that are not easy to notice or can cause reduced model outputs.

Ambiguities like this could potentially arise and confuse the model during training. However, these issues may not significantly degrade model performance unless encountered frequently. Nonetheless, using a single token as a stopping signal can help avoid such potential problems.

My experience is that the mis-match between the mistral style stop token and chatml style <|im_end|> stop token is causing the model to go on to generate synthetic user messages that it will then answer itself.

Definitely seems like something that should be addressed.

model to go on to generate synthetic user messages that it will then answer itself.

haha , facing this same issue .

Hey thanks for the feedback. I had a few discussions about this issue, it's a tricky question. I updated the tokenizer's config and created a new GGUF version of the model here: https://huggingface.co/mlabonne/NeuralBeagle14-7B-GGUF-v2

Do you mind testing it to tell me if it's fixed or not? I'd also be interested in examples where the model doesn't behave as expected.

@mlabonne I believe it's not enough to simply update the tokenizer config - I think the model itself needs to be updated to have a slot for the new token IDs in the input/output layers. Check out this very recent PR in trl which adds a helper for doing exactly this, seems pretty interesting:

https://github.com/huggingface/trl/pull/1242/files

Yes, that doesn't change the model but it helps depending on the inference tool you're using (looking at you vllm). The model can use different chat templates by default, including chatml.

I'll read it, thanks!

@andysalerno FYI: Use token list ['<', '|', 'im', '_', end', '|', '>'] as stopping works fine to me. See ChatLLM.cpp.

You can just use ['<|im_end|>'], otherwise the model would stop prematurely on any of those tokens. The braces, pipe, and underscore are commonly used in programming.

I think the above are both solutions if the problem is "I want this model to work properly in llama.cpp today" or the like.

But from a more general standpoint, it seems intuitive to me that having the stop sequence as a special, dedicated eos_token is superior. For a few reasons. 1. standardization - it's ideal if the ChatML (or whichever) format becomes the norm, not just the chat_template but also the special_tokens set. 2. generation quality - this is purely speculation on my part, but it makes sense that a single, special token for eos is superior to multiple, non-special tokens like ['<', '|', ...] for denoting end of turn in chat models. 3. from a text-streaming point of view, if you have a stateless API that's streaming tokens, you would need to keep track of the last 7 tokens to know if they were ['<', '|', 'im'...] in order to stop streaming. But with a dedicated special eos token, you don't need to keep any state, the moment you see token_id 32002, you know to stop. (This is a problem I have been encountering lately)

Hey thanks for the feedback. I had a few discussions about this issue, it's a tricky question. I updated the tokenizer's config and created a new GGUF version of the model here: https://huggingface.co/mlabonne/NeuralBeagle14-7B-GGUF-v2

Do you mind testing it to tell me if it's fixed or not? I'd also be interested in examples where the model doesn't behave as expected.

Maybe using this approach will help? https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser

Prompt format: This model uses ChatML prompt format. NEW - <|im_end|> maps to token_id 2. This is the same token_id as so applications that depend on EOS being token_id 2 (koboldAI) will work! (Thanks Henky for the feedback)

As for example:

I did a quantize on the latest version, for MLX. In my testing the model continues to ramble unfortunately. It does not output anymore <|im_end|> or <|im_start|> but now outputs </s> and continues to answer itself.

image.png

I tried a hacky solution that replaces the <s> and </s> tokens. I cannot reproduce this behavior so I assume it's a problem with the default configuration of the frontend you use. I'd like to get it right so people don't have to care about setting the right tokens. Let me know if that new version works, thanks!

I just noticed that mlx_lm converter (to MLX) changes the config (special_tokens_map.json, tokenizer_config.json, etc). I don't know exactly why. Files look similar, but not quite your version. I guess that is the problem in my case. I opened an issue here https://github.com/ml-explore/mlx-examples/issues/355

Would it be possible (or help) to use this trick here?

<|im_end|> maps to token_id 2. This is the same token_id as so applications that depend on EOS being token_id 2 will work!

https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser/raw/main/tokenizer_config.json

See

"vocab": {
...
    "<s>": 1,
    "<unk>": 0,
    "<|im_end|>": 2,
    "<|im_start|>": 32000,
...
}

Just ignore my last request please. Now that I understand a bit more, I found a fix/hack on my side. For now 2 and 32000 are pretty much standard for ChatML models, so it should work fine:

if token in [2, 32000, tokenizer.eos_token_id]:
            break

Frankly, I do not know what is happening. It is not even an issue of my frontend not stopping on <|im_end|> or </s> anymore. The model continues to write and there is no real stop word in there to be able to use it. I have 3 other chatml models, all work fine (same code).

image.png

@azinca : You have two problems, the default chat template does not new line caracters.

And It seems that the last models from @mlabonne does never generate the eos token on vllm.

I don't know if this is the origin of the problem (maybe a clue?), but I assume that the config.json file is generated by transformers.save_pretrained and is related to the training configuration.

Now, the config.json defines "vocab_size": 32000, which should be 32002, as we need to count <|im_start|> and <|im_end|>.

Setting this value manually causes an error.
So I think the problem comes from the model and not from the configuration.

After some tests, it is indeed possible to encode "<|im_end|>" => 32000 with the tokenizer (if you print the tokens_ids after a request, the token is encoded fine).
The problem is that the model never generates this token.

The fact that it seems to work with llama.cpp (I haven't tried myself) could also be a clue, but I don't know enough about it.

Same problem, would like to get a copy of the effective solution. Tried several open source models with this problem, and feel that it is a problem with the mistral architecture using the chatml format for fine tuning.

Same problem, would like to get a copy of the effective solution. Tried several open source models with this problem, and feel that it is a problem with the mistral architecture using the chatml format for fine tuning.

It may also have something to do with the use of the bagel dataset, which mixes a variety of chat_templates.

@Liangmingxin Have you tried ChatLLM.cpp? It supports token list <|im_end|> as stopping well.

I tried a hacky solution that replaces the <s> and </s> tokens. I cannot reproduce this behavior so I assume it's a problem with the default configuration of the frontend you use. I'd like to get it right so people don't have to care about setting the right tokens. Let me know if that new version works, thanks!

Sorry for all this trouble but could you please undo that last change (or even better, the last 2 changes), because with it the model may not output any kind of stop word now.

I found that the json configs at the time this model was made were quite OK: https://huggingface.co/mlx-community/NeuralBeagle14-7B-4bit-mlx/tree/main

Testing the model above I implemented a work-around on my front-end by watching explicitly for either "<|im_end|>" or "</s>" text, since the tokenizer.eos_token_id is not very reliable, as is the case for this model (which I like a lot, and that is why I've been persistent in trying to make it work with my frontend to mlx-lm).

Hi, I tried something new by combining the ChatML template with the default bos and eos tokens. Let me know if it works better!

should I be surprised it performs better using a mistral template?

Owner

I would expect it to work well too. Nice to know if you've tried other templates, haven't checked this particular one.

it doesn't work too great with chatml, for me.. mistral is better.

Sign up or log in to comment