Pythia-160m vs Pythia-160m-v0

#3
by prshnthrv - opened

Hi,
I evaluated the Pythia-160m and Pythia-160m-v0 on the Pile test group 0 and I see PPL scores of 12.92 (v1) vs 11.80 (v0).
Also, looking at some stats of the output logits on a portion of the pile training data, logits mean is 0.0909 (v1) vs 0.0157 (v0).

I notice from the changelogs on EleutherAI/Pythia github:

Changelog
[April 3, 2023] We have released a new version of all Pythia models, with the following changes to our training procedure:

  1. All model sizes are now trained with uniform batch size of 2M tokens. Previously, the models of size 160M, 410M, and 1.4B parameters were trained with batch sizes of 4M tokens.
  2. We added checkpoints at initialization (step 0) and steps {1,2,4,8,16,32,64, 128,256,512} in addition to every 1000 training steps.
  3. Flash Attention was used in the new retrained suite. Empirically, this seems to have effected the dynamic range of model outputs in some cases, which we are investigating further.
  4. We remedied a minor inconsistency that existed in the original suite: all models of size 2.8B parameters or smaller had a learning rate (LR) schedule which decayed to a minimum LR of 10% the starting LR rate, but the 6.9B and 12B models all used an LR schedule which decayed to a minimum LR of 0. In the redone training runs, we rectified this inconsistency: all models now were trained with LR decaying to a minimum of 0.1× their maximum LR.
    the new EleutherAI/pythia-1b is trained with bf16, because in fp16 the model corrupted due to loss spikes late in training.

Could the (3) above be the cause for this? Could you please elaborate more on the issue listed in (3)?

Sign up or log in to comment