bfloat 16 vs float 16

#45
by Arkea - opened

Hello, I have a technical question. Why is version 176b of the model in bfloat16 while the other models (3b, 7b) are in float16? Does anyone know the reason for such a choice? Similarly, is the bfloat16 type relevant because it appears to maintain the same dynamic range as float32 but at the expense of precision? In general, the values involved are small, so float16, which has better precision, seems more appropriate, right?

BigScience Workshop org

bfloat16 / bf16 is better than float16 / fp16 in terms of performance.
When pretraining the BLOOM models we only add V100s to train the smaller models. V100s do not support bf16, so we trained them with fp16. We used the same types for BLOOMZ models.
When doing inference with BLOOMZ, I would recommend doing it in bf16 or fp32. You can also load the model in fp16 but it will be worse than bf16.
For the smaller models, you can use fp16 or fp32.

Thank you for your response. I now understand better why certain models use float16 and version 176b uses bfloat16. When you mention that bfloat16 is more performant, are you referring to computational performance or accuracy? Because for models with normalization layers, I find it counterintuitive that bfloat16 would be more performant in terms of accuracy, as intuitively, I would have thought that precision would be more sought after than dynamic range.

sorry I was wrong it is indeed the bf16 which has better precision. So to fine tune the 7b model, for example, would you recommend switching to bf16 or preserving the original f16?

BigScience Workshop org

Hmm I think both work for fine-tuning. I'm not sure which one is better. I would guess that it depends on the number of additional training steps.
Maybe for <<1000 steps it's better to stay in fp16 & for >>1000 steps it's worth it switching to bf16.

Thank you for all this information!

Arkea changed discussion status to closed

Well, in the end, I tried to fine-tune Bloomz 3b for a task using bfloat16, but it seems that the degradation from the conversion makes the learning process very hard...

Arkea changed discussion status to open
BigScience Workshop org

Is it better with fp16?

Yes much better. Strangely, by loading the model in float 32 and using AMP in bf16, it goes much better (which I don't really understand because the weights of the model must also go to bf16 and deteriorate the model just as much...). Otherwise to preserve the dynamics of the gradients I thought loading the model in float 32 and use AMP in float 16 (which seems more logical to me) and should have good performance equivalent to bf16 (I think). I don't really know... If the AMP bf16 learning continues as well why not keep it.

Arkea changed discussion status to closed

Sign up or log in to comment