ChatML prompt format confusion - please reconsider

#3
by kalomaze - opened

If I'm reading what I read correctly, ChatML was designed in a way that expected <|im_start|> to be a custom BOS token, and <|im_end|> to be a custom EOS token. However, the special tokens are not configured any differently for this model.
This diverges from OpenOrca, another model using ChatML format, which is apparently using special tokens to represent those two strings:
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json

However, they are added as extra tokens, instead of replacing the existing BOS and EOS tokens, despite the fact that they seem to be representing those concepts (???)
Confusingly, in the GGUF quantized version of this model from TheBloke, these added tokens don't appear to be reading properly:

llm_load_print_meta: token 32000 '<dummy32000>' type = 4

llm_load_print_meta: token 32001 '<dummy32001>' type = 4

Overall, the ChatML prompt format seems highly redundant and I'm opposed to it being used in the future.
Here's my full reasoning for this:

  • The benefit of ChatML (for OpenAI) is that if you need to prevent prompt injection that 'breaks alignment', the extra tokens won't be recognized by the model. A model that's designed to be openly available has no use for this security measure.
  • There already are established prompt formats for Llama like Alpaca's ### Instruction and ### Response: style.
  • When implemented as intended (but apparently not implemented correctly yet?), the stop token is different compared to all past Llama models and seems to break the standard, confusing the established inference clients for Llama (it's not tokenizing properly in KoboldCpp, for example, and gets treated as a string)
  • This also breaks model merges that do not use the same format (if it was implemented as intended, that is, but here it's using raw strings)
  • Some interfaces for LLMs that allow for custom prompt formats were not designed to expect end tokens to be appended for both the 'User' and the 'Assistant', and there is no evidence I could find that proves this format improves model performance if implemented as intended (or if implemented like how this repository handles it)
  • There is a newline token that separates the user/assistant from the start token, but no newline to separate the end token. This design choice is very arbitrary and rigid
  • Aesthetically, it's much harder to read. Here's an example of two identical prompts between ChatML and Alpaca:

ChatML

<|im_start|>system
You are an expert dolphin trainer.<|im_end|>
<|im_start|>user
What is the best way to train a dolphin to obey me?  Please answer step by step.<|im_end|>

Alpaca

You are an expert dolphin trainer.

### Instruction:
What is the best way to train a dolphin to obey me?  Please answer step by step.

### Response:

If there's a good reason for keeping this format, I'd love to know, because this feels like way more trouble than it's worth. I am hearing similar complaints from others.

kalomaze changed discussion title from ChatML prompt format confusion to ChatML prompt format confusion - please reconsider
Cognitive Computations org

I tested it.

It works.

If it breaks, please let me know.

It technically 'works', but the implementation wasn't done in the way the format was intended to be used. It was designed with custom additional tokens in mind, but those are nowhere to be seen on the repository. You're saying you want to use this format going forward in the future, but there is no clarification on whether or not this design detail missing was an oversight or if it was intentional and that's what I'm looking for because I can't tell otherwise

Cognitive Computations org

I'm not highly concerned about this.

Cognitive Computations org

Tell ya what. If you wanna make a pull request I'll check it out and test it.

I completely understand not wanting to retrain a model just to change the prompt format, my concern is mainly on if future models will use the format.
That's why I suggested a more popular prompt format like Alpaca for future trains. If you want, I'd gladly make a pull request for the Dolphin dataset that reformats it to Alpaca / a Python script that will reformat ChatML datasets to Alpaca

I'm not a fan of this format either, for the exact reasons you've mentioned. With the first day of Mistral Orca release, there have been a ton of issues from people trying to stop it from spamming <|im_end|>. Only after this commit https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/commit/17572416df27482d71dda9ea6bdea1733d8cee5d that was largely fixed.

However, I'm not sure if the GGUF models are without problems, I did hear from some instances where there have been issues, even after this commit. So if this special EOS token is really not recognized, then that's a huge problem as GGUF is arguably the most popular format for inference right now.

This needs to be investigated. I don't think using Alpaca instead though is a good idea, as it's worse for multiturn conversation. In my opinion, it would be nice if open source can use llama 2 chat's format as a standard. It's well established and also great for multiturn chat.

Cognitive Computations org

Alpaca format never been good, it was only "good enough" to start with, but it should be phased out now.

I'm willing to accept an alternative that doesn't involve breaking compatibility in the way that chatML does

Cognitive Computations org

I am retraining dolphin with the proper tokens for ChatML. Will release as dolphin-2.1-mistral-7b.

I have made up my mind quite firmly to use ChatML. It would take math to convince me otherwise.

Cognitive Computations org
β€’
edited Oct 7, 2023

I'm willing to accept an alternative that doesn't involve breaking compatibility in the way that chatML does

(i will refer to ChatML format as new format)I am too, but not with chatml, like all current models are trained on alpaca, this new format is completely incomprehensible for current models to grasp. Think of this: All future models will use that format.. If someone wants to merge a "old" model with a new one quality will greatly be degraded as its 2 COMPLETELY different formats + the "old" model does not know what the hell to do with the new one.

Its just annoying, also the format is hard to read + not flexible, has no instruct or similar sections.
Also, those assistant/user formats in general are guiding the AI to act like one despite maybe one not wanting to.. which is bad to say the least..

tldr; I am completely okay with changing formats, but not chatml please.
(correct me if im wrong, always happy to learn)

Cognitive Computations org

These are not technical arguments.
Sorry that you are frustrated, but this change is really for the best.

ehartford changed discussion status to closed
Cognitive Computations org

These are not technical arguments.
Sorry that you are frustrated, but this change is really for the best.

Perhaps you can consider Llama-2-Chat prompt template?
From @jondurbin in a Discord channel:

I really don't think vicuna prompt format is optimal.

USER: is tokenized in multiple ways, and somewhat inherently assigns an extra identity to the model if you use a persona as system prompt.

Alpaca is ok for instructions, but the chance of markdown style header "### Instruction" or response happening in the wild is pretty large, so it's probably much easier to have strange results from prompt inputs.

chatml is better at deterministic delimiters than vicuna, but IMO llama-2 chat is better for very clearly separating system from instruction and instruction from response, and there's no identity/role terminology introduced to contend with persona in system prompt.

<|im_start|>system
you are Jon
<|im_end|>
<|im_start|>user
hello
<|im_end|>
<|im_start|>assistant

vs.

[INST] <<SYS>>
You are Jon.
<</SYS>>
hello [/INST]

Much clearer, cleaner, and less ambiguous IMO. 

Assuming this is why you chose ChatML, you might also consider Llama-2-Chat as an more readable alternative. I think the guys at Meta AI is great, they at least have a reason for so.

Cognitive Computations org

I chose ChatML for two reasons.

  1. it works
  2. it has momentum

I released dolphin-2.1-mistral-7b that fixed the ChatML token issues. And its on top of the leaderboard for 7b.

https://huggingface.co/ehartford/dolphin-2.1-mistral-7b

I have no evidence that llama-2 chat is superior to ChatML, it's just a hunch and my personal preference.

Contributing factors to my thought:

  1. The Meta folks are pretty smart, so I suspect they spent some time investigating prompt formats to settle on that one.
  2. OpenAI folks too are obviously very smart, however there's evidence ChatML itself has changed: https://news.ycombinator.com/item?id=34990391 and I wouldn't want to use a deprecated standard.

Again though, I have no evidence any format is better or worse, and with the tooling support around prompt formats I don't think we need a single unified standard TBH.

Cognitive Computations org

@ehartford you are always talking about the math, and how good it works. But ive never seen any proof of that math behind it or how good it works.

Cognitive Computations org

I already proved that it works.

I have examples of the models output in the model card.

Cognitive Computations org

I have no evidence that llama-2 chat is superior to ChatML, it's just a hunch and my personal preference.

Contributing factors to my thought:

  1. The Meta folks are pretty smart, so I suspect they spent some time investigating prompt formats to settle on that one.
  2. OpenAI folks too are obviously very smart, however there's evidence ChatML itself has changed: https://news.ycombinator.com/item?id=34990391 and I wouldn't want to use a deprecated standard.

Again though, I have no evidence any format is better or worse, and with the tooling support around prompt formats I don't think we need a single unified standard TBH.

Thanks for your perspective Jon 😊

Cognitive Computations org

@ehartford you were saying you trust the math behind it, i do trust math, but i havent seen ANY math about this yet

Cognitive Computations org

I didn't say that there was any math.

I said the only way to change my mind is math.

Cognitive Computations org

You should spend your time another way.

I'm not going to change the prompt format.

Cognitive Computations org

@ehartford i already know that, its obvious that you wont change it. I just wanna know what makes you think that its so good

Cognitive Computations org

I've explained why I am using it.

  1. it works
  2. it has momentum

I vote to switch to Llama 2 format. AFAIK there is no proof in favor of one or the other, so I will focus in the wasted tokens. ChatML uses more tokens to do the same work, and these tokens are wasted as they don't carry any useful information.

I will put a short example without system prompt, just compare ChatML:

<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
I am doing well!<|im_end|>

With Llama 2:

[INST] How are you? [/INST] [INST] I am doing well! [/INST]

If I'm correct ChatML uses 57 characters only for the prompt while Llama 2 only 31.

So if there is no proof that demonstrates that ChatML format is the best, I think we should switch to Llama 2 format to spend less tokens.

Cognitive Computations org

Cool. It's open source! You are welcome to train one your way. I'll be training this one my way.

By the way, OpenHermes just released is also ChatML.

I'm going to stay on the winning train.

Ok, anyway thank you for your work :)

I'm going to stay on the winning train.

restating that chatml is fine and I'm not going to try to convince you to change, but...

I think the numbers may disagree with you on that point:

  • mistral-7b-instruct-v0.1 downloads last month: 154,352
  • llama-2-7b-chat downloads last month: 1,152,332
  • llama-2-13b-chat downloads last month: 285,791
  • llama-2-70b-chat downloads last month: 205,955
  • codellama-7b-instruct downloads last month: 45,882
  • codellama-13b-instruct downloads last month: 20,491
  • codellama-34b-instruct downloads last month: 211,818

ChatML certainly has some momentum, and popular models (yours, teknium's, zephyr, etc.) but I think llama-2 chat format is "winning", in terms of downloads anyways.

Cognitive Computations org

I get it 😁

Cognitive Computations org

I still think ChatML is winning and is going to win. Downloads isn't my metric.

Cognitive Computations org

Downloads within some window of time anyway.

Adoption and trending is my metric

Anyway I don't need numbers to tell me. This is deeper than that.

I couldn't pass up the opportunity to do just smidge of trolling.

Sorry I did a mistake with the Llama 2 example, it's only:

[INST] How are you? [/INST] I am doing well!

So Llama 2 format only uses 16 characters (including 3 blank spaces) vs 57 characters of ChatML format. The difference is very high. A 5 turn conversation with the same text would waste 285 chars with ChatML format, but only 84 chars with Llama 2 format (extra blank space at the end of each answer)

Cognitive Computations org

Thanks for the feedback. I will keep it in mind.

Just a short update, Amazon released MistralLite, check their prompt format:

<|prompter|>{prompt}<|assistant|>

https://huggingface.co/amazon/MistralLite

Excuse me for responding to this older, closed issue - but I'd like to add some information to this discussion for the record, as a supporter of the ChatML format (and "hater" of the Llama 2 Chat format):

There were issues here with the implementation of the ChatML format, like the tokenizer issues that affected the special tokens, and it's good that those were discovered, reported, and fixed. Still, other formats have issues, too, and some are hard or impossible to fix.

Like the Alpaca format's ### which gets tokenized in different ways and collides with other meanings like markdown headers. A unique special token that's never part of input text is needed (and can be filtered when taking input from external sources, as security is definitely an issue for open source models, too, when you let others use yours).

That's not just a security measure - a proper system prompt that's understood and respected by the model is very useful. For instance, to distinguish between the user (in-character) asking the model to do something versus the user (as the AI admin) commanding it to do something. And if you do want to host your model and prevent other users from controlling it like an admin, filtering out a rogue system prompt is easier if it's properly delimited with unique, special tokens.

Llama 2 Chat's format is terrible, IMHO, as it puts the system message inside the first user message. And there are no tags to indicate the response, it's always after/between user messages, and incompatible with chatbots where the AI goes first (greeting messages are very common). All in all, it's too complicated and unintuitive - even the person recommending it messed it up in their post.

Same with Amazon's, where'd you put the system message? At least it should, in theory, support putting the "assistant" tag and message before "prompter" to have the AI go first with an introductory message, so it's better than Llama 2 Chat. Still, less flexible than ChatML, which could easily be expanded for additional roles, while you'd need to add more special tokens for those with Amazon's format.

Who knows, maybe there will be a better format down the line, but right here and now, ChatML looks to be the most flexible - and apparently that's why it's gaining traction and apparently becoming the standard.

Yeah, it looks ChatML is the best one. I read this very fast but it looks it's better than the others to prevent prompt injections.

Yeah, after having to deal personally with this format, I must agree with Wolfram here. It's really not as great as I've thought and there are tons of mistakes one can make. Plus, the format is also not really suited for RP, precisely because of the reasons Wolfram mentioned.

ChatML will probably be the better alternative.

Sign up or log in to comment