Some questions and potential suggestions

#3
by polymer - opened

It's nice seeing all the developments chug along, with no stopping in sight! Thank you for your continued work in training these models. Being able to see and experience what's on the frontier for LLMs first-hand is really valuable. I had a few questions about the training details for this (very coherent) model, with potential areas for improvement:

As per the original Orca paper, the training for their model used GPT-3.5 completions as an intermediary target to meet before later training on GPT-4 completions, making sure the model has been finalized with the higher quality data (from GPT-4) at its training completion. Relevant excerpt from paper:

We first train Orca on FLAN-5M (ChatGPT augmentations), followed by second stage of training on FLAN-1M (GPT-4 augmentations). Essentially, we leverage ChatGPT as intermediate teacher assistant for two reasons.

• Capacity gap: Orca with 13B parameters is many times smaller than GPT-4 (size undisclosed). Leveraging an intermediate teacher with reduced gap in capabilities, in this case ChatGPT, has been shown to improve imitation learning performance for smaller students in knowledge distillation [15]. This can be viewed as a form of progressive learning or curriculum learning, where the student first learns from easier examples, followed by harder ones: with the assumption that longer responses are difficult to mimic than shorter ones, along with improved reasoning and step-by-step explanation from a larger teacher . . .

From what I could gather, the training for dolphin-2.1 was performed with a mix of data in no particular order, which would suggest that the model learns using that mixed-quality target dataset all the way to training completion. If this is the case, future training could benefit from a more ordered learning process, for instance: 3 epochs of GPT-3.5 completions (w. shuffling on each epoch and a suitable LR scheduler) -> 5 epochs of GPT-4 completions, where airoboros data (which is from GPT-4) could be mixed into.

Separation of training into these stages, with each stage having its own scheduling — both leading to a final LR cool-down — can also act in a way similar to cosine annealing, refreshing the training and providing a chance for escaping local minima (thereby achieving better generalization). This would also add more hyperparameters to fiddle with, both the number of epochs and peak LR for each stage now variable.

The other question is on the filtering for both GPT completions: was the deduplication step in filtering performed per-set or across both GPT-4 and GPT-3.5 sets? Deduplication should have been performed per-set with GPT-3.5 and GPT-4 completions treated as completely separate datasets, given the curriculum learning intentions.

Last is a small question about the formatting. It seems the model outputs for the assistant role always begin with a token containing a space character (logged from llama.cpp, notice token 6880):

'':32000, '':32001, 'ass':489, 'istant':11143, '':13, ' Another':6880, ' lesser':26767, '-':28733, 'known':4717, ' but':562, ' fascinating':23069, ' quote':13658, ' from':477, ' Charles':6427

Was this an intentional decision (as with default Llama tokenizer behavior, adding BOS and having the very first token include a space), and should the user input also include a space?

Cognitive Computations org

Thanks for the insightful comment!

Yeah this is definitely not Orca anymore
That's why I call it inspired by Orca.

My attempt to replicate Orca failed. The failure was that the performance of dolphin-1.0 did not even come close to the numbers claimed by Microsoft.

Whether that means I did something wrong, or the Microsoft paper was in error, I'm not sure we will ever know.

So I have gotten creative, using dataset strategies that seem to work for others and my own mix.

What I saw in my experiments is that the 2-phase approach didn't make any difference. The outcome is the same whether I mix the data and train it all together or if I train it in two phases.

So I don't bother with that.

I'm using only gpt4 generated data and omitting the gpt3.5 generated data.

I'm also using only about 300k samples of it instead of the whole thing. I chose longer samples as a metric of quality.

Your note about the space after the assistant role - that's a really good catch. I presume that's a bug in axolotl. I won't retrain the model but I'll update the examples. I'll make sure to bring this to Wing's attention.

Thanks again!

Hmm, interesting. I wonder if the disparity in results could be indicative of undertraining (i. e. staged optimization did not incur enough learning for model to follow curriculum learning hypothesis, gaps to both GPT-4 and 3.5 remaining too far). Maybe (or hopefully) there’s a chance these small models could be better.

I’m surprised to learn that your performant models weren’t using all the data. If the fact that switching up the data mixture/training process didn’t make a significant difference is true, the LR and/or epochs could likely be set higher. Cosine annealing, perhaps with decreasing peak LRs, will definitely fight overfitting at the superficial/textual level (despite the wacky training loss curve, which won’t tell you much) while allowing the slower semantic/generalizable learning to take place for longer.

Thanks for going through my comments, hope your future experiments turn out great as well!

Cognitive Computations org

I'm seeking a partner on the dolphin project with academic / deeper ML background than me. If you are interested in collaboration, DM me on Twitter / discord.

@ehartford hi, I have seen the training args in the files. And have a question about the hyper-parameters:
the training dataset is ~300k, epoch=4, per_device_train_batch=6, and gradient accumulation=4, GPU cards=4, but the global steps=1204, is the hyper-parameters correct?

I am trying to reproduce your results, but I find the metric of ARC and Hellaswag decreases significantly during the training.

Hope to get your reply~
Thanks

Sign up or log in to comment