Comparison with 0.1

#3
by ChuckMcSneed - opened

Nice model, definitely an upgrade.
Compared to 0.1, 0.5 has:

  • Far less GPTisms
  • Much stronger bias towards roleplay(both a plus and a minus, on empty context it doesn't like to answer questions very much, tends to drift into roleplay)
  • Stricter adherence to prompt format(in 0.1 I could get away with using Alpaca format, here I have to use LLama chat)
  • Stricter adherence to system prompt, here it actually feels like it matters

Sadly, it still loses minor details deeper into the context.
Keep up the good work!

Thanks for the feedback & encouragement!

Some thoughts:

  • I tried to include more RP, but turns out it isn't great for that either (not as good as story-telling), and as you noticed, it reduced the default chat/query capability. You can counter this by explaining the context in the first prompt (or remind the AI what the task is every time you ask). In v1, I'll probably try to stick all this in the system prompt so people can choose what sort of mode they want, and try Meta's Ghost Attention trick to get the model to pay attention to the system prompt more.
  • Stricter adherence to the prompt format is a necessary thing IMO. It means there is a lot of training and it is working (assuming you don't want a model that talks like base Llama). IMO, finding better performance on different prompts is a sign that the model was not trained for the application it is being used for (which could be inevitable when merging). Same for needing to use higher temperatures. As the model improves, I'd like to see people being able to use lower temperature (or Mirostat Tau) and relying on prompts to get the creativity they want.
  • I can objectively measure the loss of minor details over longer contexts, and it is slightly worse than v0.1. The reduction in chatGPTisms came from mildly overfitting on the story-writing data, but that also reduced long-context capability. Strangely, I have working CPs that have great long-context detail preservation (better than v0.5 at least), so for v1, I'm experimenting on ways to preserve both.

I'd rather have a model that talks like ChatGPT, but remembers every minor detail, than a model that forgets about what happened 10 messages ago, but has a great, organic writing style.

I kinda went away from that and wanted the different style and lack of censorship, because I already have ChatGPT for the rest... It's not just a style of writing, but also a form of intelligence as it needs to embellish and expand on your instructions in creative ways.

It could just be I'm trying to do too much in 70B, and really I should be making a story-writing model, an RP model and an analyzing "work" model (that could sound like ChatGPT, assuming that's what you meant). Let's see... EDIT: Or maybe you meant an RP or story-writing model that can write like ChatGPT (IMO, badly), but nevertheless keeps context? That might actually be a bit easier to train.

I'd still rather try to do it all in one model.

I did mean that it is better to have a model that has a shitty writing style(ChatGPT), but keeps context. So, keeping track of things should be the priority, beautiful writing style can come later.

That new mystery 70b 32k model Miqu doesn't need to stick to the format to produce good outputs. I can throw llama chat or alpaca on it and it just works. Overfitting on the format is probably not the right way.

Lols on the potential backstories!

But seriously, I am wondering how the heck they did trained the 32k in. Did they even say what kind of rope scaling they use?
EDIT: It's base scaling like codellama (theta = 1000000), more info here. Now that is interesting, because I've been having doubts about linear rope for a while now. I just need to burn $5K or so to create a new base model (or maybe use the new Codellama 70B as a base).

Every general test I've run shows it is objectively worse if you don't fit on the prompt format, i.e., the sample efficiency goes down, assuming you're not repeating data. This is not just for chat format, but any kind of standardized input (again, as long as the actual content of the chat is changing each sample).

But... if you have a lot of data (and compute), then maybe it doesn't matter at all. A good example of this is base Llama2: it can chat, and it doesn't care about chat format. You just need to start it off on conversation round #3 or something, so it has examples of what format to continue with. And therefore, all derived models (including Aurelian) can do the same. You don't need to do anything special for a model to follow a chat format if it's just a chat continuation. It's really just how you get the model started that's the trouble, otherwise any Llama2 model can continue a chat in any format. Maybe with LORA (or fudged LORA like LongLORA/ReLora/LoftQ) and LIMA dataset sizes, you force a tradeoff for the start of the chat.

It's possible if you have a crap ton of compute, the prompt format variation actually becomes a kind of augmentation, and you're well beyond LIMA territory. Or, you're not trying to do much to change base Llama style, which is very ChatGPT-like (and you don't need a new model for that, LongLORA base itself will chat in any format upto 32K if you multi-shot/continue conversations with it). So maybe Miqu dev had that much compute... but barely enough bandwidth?

Eh, or maybe it's just a merge (and I still want to know how they merged it). That would explain more of the mystery.

I'd like to know if the dev had some secret method. I'm spending 1000s of $ on all of this, and getting no money back, it's all just community service. But 32K 70B training isn't cheap, and I'm already fighting tooth and nail for sample efficiency. At that level, supporting different prompt formats (without merging), turns very expensive.

I've noticed that the same(different format->good output) happens on mixtral-instruct, by the way. Seems like they indeed have some kind of secret technique for training. Let's hope that they'll make it public in the future.

Maybe the tradeoff with prompt formats I mentioned is a function of not having enough capacity in the linear MLP layers. MoE has more (non-attention) MLP capacity, relatively speaking. That might explain why some of the interleaved frankenmerges (like Goliath) work well, and also why they might be overkill (they repeat the attention too, which is maybe not needed).

Well, will dig into Miko and see what it's made of (or I'm sure someone will), if it is MoE or not.

I'm secretly hoping it looks like Llama, because that tells me I can train an ABF model and try to replicate what Miko does, but it probably won't be that simple...

My initial peek tells me Miqu is a Llama2 fine-tune. Same tokenizer and architecture. Of course, it could still be a Mistral 70b leak/teaser if Mistral also copied Llama2 closely, but that would mean it is not something special like MoE? I didn't do weight comparisons.

Because they used ABF scaling, and because it came up before Codellama 70b, they probably did their own long-context training. Which is interesting, that's not a simple matter.

@ChuckMcSneed I've been trying different prompting formats after you pointed out that it makes a difference. There is definitely a bias there. Base Llama takes very easily to the Alpaca format, and has far fewer repetitions even with no repetition penalty and deterministic sampling. It may be that Llama2 was trained on some Alpaca datasets in pre-training (it was the most widespread and earliest of the fine-tuning/LORA approaches). It has problems of its own though, like inconsistent tokenization.

Vicuna format also had some out-of-the-box compatibility, but not as good as Alpaca, and the model got confused when dealing with egs., movie scripts, RP logs, or any other text that had a NAME: format.

Llama chat format was one of the absolute worst and the model had a hard time figuring out what was happening.

Any format with a lot of <s> and </s> seemed to have the potential to confuse the model. But when used sparingly, it can result in consistent tokenization.

Now all of this can be fine-tuned away, but I'd rather not take fights I don't have to... Based on my experimentation, something like this worked quite well with zero fine-tuning (multi-shot with examples):

<s>### System:
System message</s>

<s>### Instruction:
Do this</s>

<s>### Response:
Okay</s>

<s>### Instruction:
Also do this</s>

<s>### Response:

It worked better than Alpaca because there is no \n### which tokenizes in multiple ways (whereas \n<s>### always tokenizes the same way).

Anyway, thanks for bringing up the topic of prompt format. I guess I'm still saying that there is such a thing as a good format, probably because of Meta's pre-training, but Llama-chat was likely a bad pick.

Any format with a lot of <s> and </s> seemed to have the potential to confuse the model.

You say and then proceed to introduce a format with <s> and </s>, needlessly overcomplicating something that already works without any problems. Sometimes it's just better to keep things simple instead and don't worry about tokenization. None of Alpaca-trained that I tested had any problems with formatting.

There may not be such thing as good format, but there definitely are bad, awful formats. Here's my tier list:
S-tier:

  • Omnivorous: Feed a simple chat with example conversation and it will work. (Falcon-180b-chat, most base models, partially Goliath)

A-tier:

  • Alpaca: It just works.

C-tier

  • Vicuna: Okay-ish.

D-tier:

  • Llama-chat: Don't like it too much, overcomplicates stuff, very unnatural.
  • Tulu: Having <> eats tokens and confuses the model, might as well use vicuna.

F-tier:

  • Chat-ML: 🤮🤮🤮 More like chat-MLEH. Godawful format that nobody should use.

Have you still been trying to get longlora to perform optimally? I think that at this point it may be easier to remove GPT-isms from Miqu by training or to find a way to transfer training to it than to endure that giant performance hit caused by longlora. There have been attempts to do both: alchemonaut/QuartetAnemoi-70B-t0.0001(transfer of training, haven't tested myself) Undi95/Miqu-70B-Alpaca-DPO(dealignment, imo it failed). Do not bother with Qwen though, that model was trained on benchmarks and it doesn't hold up in practice(I'm looking at you, Smaug.), and it also eats shitton of memory.

You say and then proceed to introduce a format with <s> and </s>, needlessly overcomplicating something that already works without any problems. Sometimes it's just better to keep things simple instead and don't worry about tokenization. None of Alpaca-trained that I tested had any problems with formatting.

Well, I did say you don't want to use a 'lot of' them :D

I should clarify, in my testing above, I was evaluating:

  • Does the model (without specific finetuning) respond in the same format when prompted in the format, with 4-5 examples in the history?
  • What is the probability of ngram repeats with repetition penalty = 1? Llama repeats on everything at some point, but for some reason, less on Alpaca.
    All this was done with ~10K context and with different methods of extending context (and any related fine-tuning was unsupervised, without a prompt format).

Alpaca was consistently the best performer.

That said, Alpaca tokenizes weirdly. My attempt at adding <s> was to remove that problem. You're right, in that I observed no difference in practice between normal Alpaca and the modified version I posted, with my above criteria. It just tokenizes more consistently, but Llama doesn't seem to care in practice.

Variations of my suggestion don't work as well. Example, this one is not good: <s>###Instruction:</s> Message. I don't know why moving the</s> after the message works (or even removing EOS completely also works). I introduced it because ### is sometimes tokenized with the previous prompt, sometimes not, and it depends on whether or not the backend server (egs., Oobabooga) tokenizes each message separately (this works well) or tokenizes the entire context history as a single chunk of text (this has issues), with vanilla Alpaca. The <s> basically forces the tokenizer to break in between conversation rounds. Maybe this is a psychological solution to a problem that doesn't need solving.

I agree with your tier list for chat prompts generally (other than the S-tier which is more a model statement). If vanilla Alpaca practically has no problems, maybe we just go with it, even if it tokenizes in strange ways.

Have you still been trying to get longlora to perform optimally?

Yes, and longLORA basically does not do well no matter what I try. I don't know if it is the linear rope scaling, the attention calculation approximation used by the original authors, or there is some side effect to training embed/norm layers along with the LORA.

On the other hand, my experiments with ABF (theta) scaling seem better. No theory, just lots of experiments and GPU compute to find a good method. I will release a base model and context-extension LORA with theta scaling soon, to test and compare with longLORA if you're interested in helping.

I think that at this point it may be easier to remove GPT-isms from Miqu by training or to find a way to transfer training to it than to endure that giant performance hit caused by longlora.

I'm still focusing on getting an alternative working from scratch that doesn't take the hit longLORA does. If Miqu did it, there must be some way. We know Miqu (and Meta, internally with their unreleased general-purpose long-context model) used ABF, so I started with that. That said, would love to see people make transfer training work!

There have been attempts to do both: alchemonaut/QuartetAnemoi-70B-t0.0001(transfer of training, haven't tested myself) Undi95/Miqu-70B-Alpaca-DPO(dealignment, imo it failed).

Aurelian v0.5 was also DPO de-aligned and I'm pretty sure that's what eroded its long-context instruct following. I have earlier checkpoints without the DPO that do better, but sound more like ChatGPT. However, in my ABF examples I injected dealignment samples at the pre-training stage, and this seems to work (still need to experiment more). After SFT the model remembers the pre-trained base (with removing SFT samples with alignment also), and I don't seem to need the DPO for de-alignment.

Now this method is for building a model without merges (like Aurelian), but I would like to test if this also works if you merge the pre-trained LORA on a post-merged or SFT model like WinterGoddess, lzlv, etc. Would be nice if we can inject both long-context and de-alignment with a LORA (but my guess is that the SFT alignment in between will still be too strong).

Do not bother with Qwen though, that model was trained on benchmarks and it doesn't hold up in practice(I'm looking at you, Smaug.), and it also eats shitton of memory.

Everyone tells me this. Pity, I want a good 70B alternative to Llama2. Hopefully Llama3, Yi or Mixtral/Mistral in the future.

Looking closer, yes Alpaca tokenizes weird, but Llama2 has no problem with it. So it’s a problem but Meta seems to have solved it for us, with probably more compute than we could throw at a LORA.

Sign up or log in to comment