Thank you.
Thanks so much for this. Do you have any plans to adjust the 70B base model?
No problem! Any time! For 70B, we actually applied the same script and didn not find as much issue with untrained tokens as the 8B model.
The 8B model has over 200 special tokens that are untrained with straight all 0s in the embedding matrix.
For the 70B model, these same special tokens do have lower values in the embedding matrix, but they aren't all zeros. The values for these special tokens in the 70B model are between 0 to 1e-4.
Are you running into NaN
gradients issues or exploding gradients during fine-tuning for the 70B model? If so, I can apply the same script but it may do more harm than good if I get rid of these non-zero values (especially if they don't cause problems during fine-tuning).
(edited for grammar)
Thanks, that’s reassuring to hear. Yes, I believe it’s actually 250 untrained reserved tokens! I’ve only tried training the 8B so far, intending to train a QLoRA on the 70B later this week. I’ll let you know if I encounter issues.
Thanks again for sharing this fix for the 8B.
As requested, I made the 70B model https://huggingface.co/astronomer/Llama-3-70B-Special-Tokens-Adjusted. I don't know how useful this is tho. I did thresholding on a max value instead of finding all zeros on the tokens and yielded similar "undertrained" tokens. I would say if you don't have a problem fine-tuning with the base model directly then don't use this since setting the undertrained tokens with the mean still isn't that ideal.
Thank you. It’ll be interesting to run some comparisons.