Some suggestions for optimization

#3
by polymer - opened

Just a few ideas:

Some of the datasets here contain information, such as scientific knowledge or logical progressions, while others are somewhat more towards the style of writing and various connections between parts of the context. The model should benefit from epoch adaptations, such as running the information datasets through the model for more iterations overall.

I believe the epochs were performed with a pre-scrambled dataset as well, looking at the clear dips in the training loss graph. Dataset scrambling done after every epoch should be something important to look into for better approximation of the global minimum in training. Otherwise, the model will theoretically be biased in a certain direction depending on current epoch completion, and this will harm the training as it drifts away and back towards the optimum. This is important for our fine-tune dataset sizes, as it really takes place with a relatively small amount of data.

I do have some more specific thoughts for better performance with the data we have access to — if that sounds interesting for you/your team — that would require some more tinkering.

Great work nonetheless!

Open Access AI Collective org

Thanks for the feedback! Re-shuffling after each epoch is a bit labor intensive, but I did incorporate your feedback and stopped it at 3of 4 epocks and I'm starting again at a lower peak lr with a re shuffled dataset. I would definitely be interested in whatever other thoughts you have for incremental improvements. Thanks!

Wow, you sure do move things along quite quickly!

After a bit of morning coffee, the loss graph itself may look similar after revision, lol. But, the theory of preventing verbatim biasing will likely remain applicable, so this is not an issue. It should certainly be reshuffled despite the extra compute.

The other ideas I had were about training more for the inner information content from these small datasets, where we extract non-verbatim information to better target inner thought processes (which are more generalizable) for the models. Bit tight on time at the moment, so I’ll add some details later.

Open Access AI Collective org

thanks! one quick question, after restarting the eval loss is on the uptick. Is this normal? Wait it out? https://wandb.ai/wing-lian/manticore-13b. Btw, you on discord?

If you are referring to manticore pre-alpha <discarded>, the peak learning rate was set lower, so it'd only be reasonable to see higher eval loss initially (there were less changes made, of course!). Lower learning rates would help later loss as the (longer) training converges more steadily, but only if previous rates were set too high to begin with in allowing convergence. Diminishing returns are definitely expected, because setting it too low will waste lots of compute from the longer training needed: there's some room for experimentation. I'm on and off all the time, so no Discord. But I will be happy to discuss here as I can see messages when I check my email.

You must have checked this already, but I would also ensure the shuffling doesn't mix different datasets with no split in the middle. I mean, you wouldn't want a Vicuna conversation with some mmlu nonsense being said by the user, right?

Open Access AI Collective org

I do shuffle the datasets too. the various examples from different datasets would always be seperated by a bos and eos token, right? I've always shuffled them (https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/data.py#L141-L143) before packing them into groups of 2048 tokens (similar to what stackllama does https://huggingface.co/blog/stackllama#supervised-fine-tuning)

Ah, the EOS-then-BOS couple is the kind of split I was talking about. This should be okay; I can't see any other issues that may arise in training for now. Although . . . I do wonder how the lack of an EOS before the BOS affects the inference in LLaMa implementations. This is what llama.cpp gives to the model for a given prompt (token with ID 1 is BOS):

     1 -> ''
   319 -> ' A'
 13563 -> ' chat'
  1546 -> ' between'
   263 -> ' a'
 12758 -> ' curious'
  1404 -> ' user'
   322 -> ' and'
   385 -> ' an'
 23116 -> ' artificial'

Didn't see any new runs in wandb, but has training begun already with per-epoch shuffling added?

Open Access AI Collective org
edited May 19, 2023

if you go up to the project level you can see the shuffled one. https://wandb.ai/wing-lian/manticore-13b/runs/tcspiljt?workspace=user-wing-lian

here's a screenshot of what the tokens look like, red are label ignored out, values that follow are label, attention_mask, and input iirc

Screenshot 2023-05-19 at 4.47.18 PM.png

Open Access AI Collective org

also, I think I'm going to call it quits on that 2nd run. it's looking like it's going to improve

Hmm. Yeah, it is a bit strange with how BOS and EOS are split in LLaMA, whereas GPT does not have a separate BOS (I think) to have a headache about.

I have a hunch that somehow getting rid of the ending EOS will benefit the generation, as it results in a single unified separator between examples. As it stands, we have no way of guaranteeing the transformer only cares about the beginning <s> token in the context before generation. It’s most often a </s><s>, then the beginning of the prompt. For all we know, a good generation could depend on there being a </s> as well. No implementation out there prepends a </s> in the prompt . . . Something to have a think about, I guess.

Read through entire paragraph:

For the scrambled run, I thought you meant the training was going to restart from the beginning! My bad, damn. This suggestion was meant to hopefully combat overfitting behavior, allowing more generalized parts of the model to catch up. Might not help too much with a model already halfway. But … is eval using the exact same (or very similar) dataset to train? Higher loss might actually indicate less overfitting, if that’s the case. The loss is only a proxy for how “well” the model’s doing, and lowest loss is not actually the goal. Please do share more details about the eval dataset; the fourth epoch could still be an improvement and be very usable, if this really is the case!

It also appears the </s> token is still present in text form somewhere. I think we should check out some of the Vicuna examples packed into the context: LocalLLaMA Reddit

Edit: wording.

The “spiky” loss is then likely indicating the model jumping in and out of a slight overfit. A nice looking loss curve wouldn’t actually indicate good, generalized performance: perfect memorization will give you zero loss, but not actually what you want …

Sign up or log in to comment