TheBloke/StableBeluga2-70B-GPTQ · Why so few 8 bit capable models?

Just genuine curiosity, it seems like things are either full on 32bit, fp16 or 4bit/3bit quantized for the most part. Is there something special about 8bit quantized that makes it undesirable? For instance I can easily fit the large models at 4bit but an fp16 of them stretches beyond my vram. 8bit would fit and use up in most cases a larger amount of my VRAM. Is it a performance thing that there isn't much difference between 8 and 4 or is it more that there aren't many people that could do 8 and not do 16 so there just isn't a demand? It just seemed really odd to me that 8bit just isn't very prevalent at all in the community.

Very simple. 8-bit is slower than 4-bit because of memory bandwidth. There's just more gigabytes to copy around.

On CPU however (ggml llama.cpp), TheBloke does often provide an 8-bit option.

I agree that 4-bit quantization is no good. 5-bit should be a minimum.