Reading the configuration files and provided examples, this model seems to be the result of a rank-8 LoRA with the given datasets. However, it isn’t clear what the fine-tuned formatting is supposed to be, seeing as how even the provided examples are a mix of the current formats.
Axolotl appears to ingest all these dataset types differently, but was the training done with a unified format, or separate formats? (Vicuna V0? V1.1? Alpaca?)
Starting a discussion for hopefully some clarification or a correction of my understanding.
Edit: The “adapter” parameter seems to have been left empty, so I presume this means that it wasn’t actually a LoRA, but a native fine-tune.
This was a native fine tune of all the weights. the lora parameters were left in the config just as a copy paste. One could easily recreate a lora version of this simply by setting
adapter: lora. Correct, we use all the dataset types, and train according to each dataset's original training prompt format. A plan is in the works to try to force them all into a format like open-assistant's proposal here: https://github.com/LAION-AI/Open-Assistant/blob/main/model/MESSAGE_AND_TOKEN_FORMAT.md
I see. If there are indeed plans to unifying the prompting, there may be more value in finding a more plain conversational format (like Vicuna V1.1), as the special added tokens seem to cause accessibility issues for users (see llama.cpp, where it isn’t a simple task for average users to add special tokens).
For example, the internal (raw) style in Vicuna V1.1 is as follows:
A chat between … answers questions. USER: Hello! ASSISTANT: Hi, how can I help you today?USER: What is 2 + 2 equal to? ASSISTANT: 2 + 2 is equal to 4.
On a side note, when using said V1.1 formatting on wizard-mega-13b, I find that the model currently generates “</s>” as literal text tokens (sometimes alongside the correct end-of-stream token) instead of just a single end-of-stream token. Tokenization issues may be present in the Axolotl code.
Thanks for the feedback. I'll probably have to dig deeper to figure out why the eos token is getting generated as literal text, esp as its defined in the file as a special token. https://huggingface.co/openaccess-ai-collective/wizard-mega-13b/blob/main/special_tokens_map.json#L9-L15
One of the things I actually want to eventually try is to mix and match all the user and assistant tokens. The model should really be capable of generalizing the meaning of each, and since we always provide for example
<|assistant|> in the inputs, it doesn't need to know to generate it one way or another, and it can be pretty robust in dealing with various differences in training prompts vs inference prompts. Thoughts?
I’d check the tokenizer used and its list of accepted tokens in your training program. If you were simply adding the text “</s>” and tokenizing as done in FastChat, the text sequences were not being tokenized properly. If nothing else works, adding EOS using the ID could work, although not ideal (unrelated, but also check if the text “</s>” is in the actual dataset, it shouldn’t be present). I noticed that it also generated both of them sometimes (the text “</s>” and then an actual EOS).
I saw discussions here about getting these models to generalize better (in terms of prompt formatting), but I’m not sure whether this should be prioritized with our goals of reaching a size-efficient model using a diverse (both instructional and conversational) dataset. If we want to give it the best chance for high quality responses (by using those weights instead for better output), I think at least the “user” marker/markers should stay consistent to help the attention mechanism reliably find where the human sequences are.
No matter what, the one thing that should probably be kept constant is the conversational style, as a chatty model can take instructions but an instructional model will not converse well. Maybe the “assistant” token/tokens could be varied per chat with different token combinations to hopefully allow custom identities (like using “BING:” or “GPT-4:”).
I think the eos tokens have something to do with the change to "Fast" tokenizers in a recent release. I spent a lot of time getting it right, and I guess they broke it with Fast Tokenizers 🤬
Haha, yes. It’s always the “upgrades” that break everything. 🙄