Reasoning behind bfloat instead of float?

#3
by xzuyn - opened

Saying 'title' doesn't really clarify your question... I'm going to assume you mean why people use it in general for AI.

The short version is that bfloat (brain float) is able to store values of the same range as a 32-bit float. It does so with lower precision of course, but from an AI perspective, it offers almost the same prediction accuracy as full 32-bit floats do at much lower memory and processing cost. Standard 16-bit floats weren't really designed for AI, they have better precision than bfloats (which is better for most use-cases), but also store a smaller range than 32-bit floats do (which affects prediction quality). It's also much easier to convert from 32-bit float to 16-bit bfloat since you only have to truncate the mantissa bits down to 16-bit.

This is all pretty theoretical though, I've trained with both myself, and I honestly can't say I've ever noticed a huge difference, but supposedly the increased range of bfloat's is better for AI than traditional 16-bit floats.

This image shows how 16-bit floats and bfloats compare to a 32-bit float in terms of how the bits are used, you'll note it has the same number of exponent bits as a 32bit float, it's basically a 32-bit float with a smaller set of fraction bits:
Comparison-of-the-float32-bfloat16-and-float16-numerical-formats-The-bfloat16-format.png

Saying 'title' doesn't really clarify your question... I'm going to assume you mean why people use it in general for AI.

sorry, I should have added more details.

I'm asking why PygmalionAI decided to use bfloat instead of float like nearly all the other models I've tried. Normally we can quantize to ggml straight from the float, but because its bfloat we have to convert to float first then to ggml, which is an extra step which makes the quality worse.

Pygmalion org

Hi there! So, the reason we were using bfloat16 during our training is due to our usage of FSDP for our model parallelism scheme. Using FP16 with FSDP results in overflows/underflows showing up during training, which obviously leads to problems - this is why we had to use bfloat16. It sucks to hear that the quality drops so much with GGML due to the casting required to FP16 before quantization, though, so we're looking into possible ways we can train in FP16 without the required computation being too much for our hardware to handle. No promises we'll be able to find a solution though, sadly. We'll keep working on it! Thanks for the heads-up about the GGML quantization.

To me it sounds more like an issue with the ggml conversion tools not supporting bfloat. From a technical standpoint, bfloat is better for AI models. It's not just something some random dude came up with, it actually offers very similar performance to 32-bit AI models at half the cost.

It's also very easy to work with, which leads me to my next point: It's possible to do conversion from bfloat back to 32bit float with zero loss in precision as you're just padding each 16bit bfloat value with zeroes in the mantissa to convert it to 32-bits. So if the issue is that the ggml code can't do its work in the native 16bit bfloat format, then converting it to 32bit floats should lead to zero loss in precision.

Of course, we're ignoring the fact that you're dropping the model all the way down to 4bit precision anyway, so at that point it's highly unlikely that you're going to see any loss related to the conversion from bfloat to float...

It sucks to hear that the quality drops so much with GGML due to the casting required to FP16 before quantization

I'm unsure of how much the quality degrades. I've just read in passing from concedo (of KoboldCPP) that the bfloat to float step can cause degradation. So right now its more of an assumption that there is quality loss, but I don't think anybody has actually tested for quality loss yet.
quote.png

Would be nice to see someone test how the perplexity compares with the original bfloat16 vs ggml q5_1 or q8_0, in comparison to a range of other models with their original float16 vs ggml q5_1 or q8_0 to see how much using bfloat matters in quality loss when needing to convert to float. To make that last part a little more clear (because I'm bad at being clear), I'm not saying compare different models quality, but compare how much their perplexities change when quantizing. Maybe even just a perplexity test on the bfloat vs float converted from the bfloat to see how much that changes things would be good enough.

What they should be doing is converting from bf16 to fp32, then quantising from there. As mentioned earlier, the conversion from bf16 to fp32 is lossless since you're literally just padding the mantissa with zeroes. You'd probably get better results since you'd retain the full floating point range that bfloat offers prior to the quantisation. But honestly, at the point that you're squeezing the results down to 4-bits, I highly doubt it would make a difference.

What they should be doing is converting from bf16 to fp32, then quantising from there. You'd probably get better results since you'd retain the full floating point range that bfloat offers. But honestly, at the point that you're squeezing the results down to 4-bits, I highly doubt it would even make a difference.

Yea it seems that they are talking about this right now. The above messages were from yesterday right after the model dropped and we were trying to get it to convert to ggml. Here's a quote of a quote from 10 minutes ago by 0x000011b; "It's possible to do conversion from bfloat back to 32bit float with zero loss in precision as you're just padding each 16bit bfloat value with zeroes in the mantissa to convert it to 32-bits."

So doing a quantized ggml (until llama.cpp supports starting from bfloat) should be done like this; bfloat16 -> float32 -> float32 ggml -> quantized ggml.

Pygmalion org

Here's a quote of a quote from 10 minutes ago by 0x000011b; "It's possible to do conversion from bfloat back to 32bit float with zero loss in precision as you're just padding each 16bit bfloat value with zeroes in the mantissa to convert it to 32-bits."

Haha, that's a quote straight from @RazielAU above actually. Indeed, me and concedo were talking about how bf16 -> fp32 -> ggml would likely be better than bf16 -> fp16 -> ggml. Either way, the TL;DR for now is

  • fp16 was giving me problems with FSDP, but seems fine with DeepSpeed
  • Since there's so much code out there that isn't even aware bf16 exists, I'm probably better off training (or at least releasing a version in) fp16.
    • ...unless bf16 -> fp32 -> desired format is better than fp16 -> desired format. Lots of people are going around trying different quantization and conversion methods then running perplexity tests, so at the moment I'm just waiting on this info to bubble back up to me to make a proper final decision.

Hahaha, I don't mind. Where's this discussion happening? Just curious to see how it goes.

Pygmalion org

Hahaha, I don't mind. Where's this discussion happening? Just curious to see how it goes.

On the Kobold Discord server. Most of the discussions we've been having about model quality and behavior have been on the koboldcpp channel and on the Pygmalion model thread there - I believe if you Google for "Kobold discord" you should find an invite link.

I'm already on there, just wasn't aware of the discussion...

Found this discussion accidently. I have added PR for converting pytorch model to GGML https://github.com/ggerganov/llama.cpp/pull/1309

It converts to FP32 by default, so it should be precise ;D

$ ./cmake-build-release/bin/main -m models/metharme-7b/ggml-model-q4_0.bin --threads 8 --prompt $'Manager\'s Persona: Manager I work with in my company.\n<START>\nYou: Sorry, I was late... Don\'t fire me...\n' --ctx_size 1024 --tfs 0.98 --mirostat 2 --mirostat_ent 4 --n_predict 40

main: seed = 1683167819
...
sampling: repeat_last_n = -1, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 0.980000, top_p = 1.000000, typical_p = 1.000000, temp = 1.000000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 4.000000
generate: n_ctx = 1024, n_batch = 512, n_predict = 40, n_keep = 0

 Manager's Persona: Manager I work with in my company.
<START>
You: Sorry, I was late... Don't fire me...
Manager: (shaking his head) What else would I do with you?
You: I mean, I didn't plan to be late. It just happened... I'm sorry...

Nice, thanks for dropping in and letting us know!

Sign up or log in to comment