Ability to generalise
I need to start by saying that I have been using your models since the 65B versions. I found them to have the highest level of "intelligence" and were rivalled only by Guanaco.
So, now for the conflict part of my post:)
I appreciate the effort that you took to expand the baseline model capability but I personally believe that mandating a certain prompt structure for the model to produce reliable response is not the best direction to take. We have ML for this and ML is not suitable for tasks that LLMs can do. I do not think that we need to seek convergence between ML and LLM spheres.
Thus, ideally Airoboros 3.1.3 would be robust enough to decide how to process the prompt without having to be cajoled or triggered into doing it "correctly".
Which functionality are you referring to specifically? Or do you mean the prompt format?
The prompt format. The way I see, and this is the "important" thing for what I am saying, meaning just how I would like things to be is to fret less about the prompt engineering, and focus more on making use of the output.
The second reason I see this as important is because of the potential breaking changes that changes to prompt formats cause, and divergence between how models like to be prompted making it more difficult to maintain the code that I write, and to compare and select among different models using the conceptually identical prompt.
I agree it's a pain in the butt, but maybe the solution is not to make the model respond equally well to any format. I have to imagine that has a cost, especially when we have very limited number of parameters in a LORA fine tune. In an ideal world, we would all experiment and pick a format (egs., ChatML, Llama-chat) that works well, and a tokenizer to go with it (to efficiently and consistent encode the start/end/role tokens), and we all train models in that format.
But I think it's quite messy right now and not everyone agrees on the best way. There are formats that are objectively a bad idea or always tokenize inconsistently, repos (egs., Axolotl) that make it complicated to modify the format, etc. There is some move toward using ChatML now I think with Hugging Face starting to support it natively, as well as natively providing code for chat input creation using it, but it still needs proper support of special tokens and so on, which is also all over the place.
So yeah, right now I do write code each time and have to add a bunch of clauses every time to handle format variations, but I view that as a price to pay to get the best quality outputs as we move toward some kind of convergence. Best we can do is encourage people to use a specific format and argue for it, rather than have them dump resources into training a format-agnostic model, IMO.
Prompt format is also not prompt engineering from the user point-of-view I think. There is a difference in handling the prompt format with whatever front-end there is (common to all queries with that model), vs having to actually smith the wording of each query.
thank you for the detailed answer...and yes it is not "prompt engineering" exactly. I was considering the changes to my python programs that I need to make, hence the overlap with the "engineering" context. I guess my main issue is that of unfamiliarity and the need to adjust my work...so I am complaining because I want the perfect future today :)
I understand the frustration, and I knew the switch from vicuna style to llama-2 chat would be somewhat painful, but it's for the best, especially now that most of the inference backends support this format out of the box.
It was a bandaid that needed to be ripped off.
re: vicuna USER/ASSISTANT
USER: can be tokenized in multiple ways depending on surrounding characters, and somewhat inherently assigns an extra identity to the model if you use a persona as system prompt.
Alpaca is ok for instructions, but the chance of markdown style header "### Instruction" or response happening in the wild is pretty large, so it's probably much easier to have strange results from prompt inputs (e.g. RAG)
chatml is better at deterministic delimiters than vicuna, but IMO llama-2 chat is better for very clearly separating system from instruction and instruction from response, and there's no identity/role terminology introduced to contend with persona in system prompt.
<|im_start|>system
you are Jon
<|im_end|>
<|im_start|>user
hello
<|im_end|>
<|im_start|>assistant
vs.
[INST] <<SYS>>
You are Jon.
<</SYS>>
hello [/INST]
Much clearer, cleaner, and less ambiguous IMO.
llama-2 chat format, at least by model download count, is becoming the standard:
- mistral-7b-instruct-v0.1 downloads last month: 154,352
- llama-2-7b-chat downloads last month: 1,152,332
- codellama-34b-instruct downloads last month: 211,818
I don't plan to change the prompt format again, unless there is some proof that a different prompt format is superior.
The transformers library has an apply_chat_template method which I would recommend using to reduce friction.
Also Chat-ML needs <|im_*|> added to the special tokens list to have its prompts tokenize correctly, otherwise once again you have ambiguous/random tokenization. This potentially adds more ways for things to go wrong (people may resize the tokenizer to non-multiple of power of 2 causing performance issues, tokenizer implementations may just not support special tokens, textgen UIs may not tokenize special tokens correctly such as Ooba in some cases).
But Llama-chat uses BOS/EOS to separate its prompts, so has none of those issues. Only issue with Llama-chat is the somewhat strange format that doesn't neatly fall into the ways the others are implemented (it's so easy to mess it up!).
Personally, I've been using modified Vicuna:
<s>SYSTEM: <message></s>
<s>USER: <message></s>
<s>ASSISTANT: <message></s>
or
<s>SYSTEM:</s><s><message></s>
<s>USER:</s><s><message></s>
<s>ASSISTANT:</s><s><message></s>
which tokenizes consistently due to the embedded EOS/BOS and also does not require special tokens to be added.
I didn't think about USER/ASSISTANT getting referenced/confused with the actual query, so that's a good point! However sometimes I want to use system messages to reference the user and/or assistant directly (a lot of canned sys messages do that), and having those sub-headings maybe helps the AI interpret them more easily?