Very bad results with model quant and KV cache quant, only BF16 works well

#34

by qenme - opened about 1 month ago

Discussion

qenme

about 1 month ago

•

edited about 1 month ago

Hello,

First of all, this model series is great. It's concise and precise when thinking. It's great at BF16. However, some community members have been showing that Gemma 4 is very sensitive to quant or KV cache quant. I have found similar results.

Do you have any recommendations on what people can do to quant the model and KV cache but at least get more favorable results than what I showed?

It's very possible that my testing is flawed, so I will provide detail. I am using llama-cpp's llama-perplexity to do some perplexity and KL divergence testing and here are some results: The data set is a smaller subset of wikitext-2 (half the size). Different datasets produce similar divergence. The model is gemma-4-26B-A4B-it-BF16.gguf running in llama-cpp.

Note that day 1 results of this were much worse (because attention rotation wasn't working), this is the latest llama-cpp as of 4/26/26 where attention rotation works with q8_0 and below.

I am also uploading screenshots since I can't upload a csv. The screenshots are of these results in a table format in case the pasted text below is hard to read.

Same top P comparison of model and KV cache in various formats:

KL divergence comparison of model and KV cache in various formats:

EDIT 1: It seems that going BF16 to ANYTHING tanks the stats. Going from F16 to anything lower is showing more standard results. E.g. F16 KV cache to Q8_0 KV cache is 1% difference in same top p.

Full table:

Here model in BF16 with KV cache at BF16:
perplexity: calculating perplexity over 210 chunks, n_ctx=512, batch_size=2048, n_seq=4
Final estimate: PPL = 18485.4149 +/- 614.44801

Here's the model in BF16 with KV cache at F16 (note the same top p score):
====== Perplexity statistics ======
Mean PPL(Q) : 18152.525879 ± 603.267310
Mean PPL(base) : 7058.138346 ± 189.564955
Cor(ln(PPL(Q)), ln(PPL(base))): 93.80%
Mean ln(PPL(Q)/PPL(base)) : 0.944628 ± 0.012298
Mean PPL(Q)/PPL(base) : 2.571857 ± 0.031629
Mean PPL(Q)-PPL(base) : 11094.387533 ± 430.486841

====== KL divergence statistics ======
Mean KLD: 0.485541 ± 0.006790
Maximum KLD: 29.637039
99.9% KLD: 16.962774
99.0% KLD: 8.318924
95.0% KLD: 2.527373
90.0% KLD: 1.109702
Median KLD: 0.025509
10.0% KLD: 0.000095
5.0% KLD: 0.000010
1.0% KLD: 0.000000
0.1% KLD: -0.000001
Minimum KLD: -0.000238

====== Token probability statistics ======
Mean Δp: 0.065 ± 0.037 %
Maximum Δp: 99.995%
99.9% Δp: 90.877%
99.0% Δp: 26.504%
95.0% Δp: 3.657%
90.0% Δp: 0.588%
75.0% Δp: 0.002%
Median Δp: -0.000%
25.0% Δp: -0.002%
10.0% Δp: -0.575%
5.0% Δp: -3.570%
1.0% Δp: -25.089%
0.1% Δp: -88.887%
Minimum Δp: -100.000%
RMS Δp : 8.515 ± 0.148 %
Same top p: 81.675 ± 0.167 %

Here's the model at BF16 with KV cache at Q8_0.

====== Perplexity statistics ======
Mean PPL(Q) : 18279.907685 ± 607.276902
Mean PPL(base) : 7058.138346 ± 189.564955
Cor(ln(PPL(Q)), ln(PPL(base))): 93.59%
Mean ln(PPL(Q)/PPL(base)) : 0.951621 ± 0.012447
Mean PPL(Q)/PPL(base) : 2.589905 ± 0.032237
Mean PPL(Q)-PPL(base) : 11221.769338 ± 435.026501

====== KL divergence statistics ======
Mean KLD: 0.544280 ± 0.007331
Maximum KLD: 33.094250
99.9% KLD: 18.138201
99.0% KLD: 9.032127
95.0% KLD: 2.870639
90.0% KLD: 1.278298
Median KLD: 0.032150
10.0% KLD: 0.000118
5.0% KLD: 0.000013
1.0% KLD: 0.000000
0.1% KLD: -0.000001
Minimum KLD: -0.000309

====== Token probability statistics ======
Mean Δp: -0.016 ± 0.038 %
Maximum Δp: 99.997%
99.9% Δp: 90.437%
99.0% Δp: 25.569%
95.0% Δp: 3.978%
90.0% Δp: 0.649%
75.0% Δp: 0.002%
Median Δp: -0.000%
25.0% Δp: -0.002%
10.0% Δp: -0.596%
5.0% Δp: -3.728%
1.0% Δp: -27.902%
0.1% Δp: -88.848%
Minimum Δp: -99.998%
RMS Δp : 8.851 ± 0.150 %
Same top p: 80.196 ± 0.172 %

Next, here's the model at Q8_0 with KV cache at BF16.

====== Perplexity statistics ======
Mean PPL(Q) : 18851.154285 ± 627.063932
Mean PPL(base) : 7058.138346 ± 189.564955
Cor(ln(PPL(Q)), ln(PPL(base))): 92.92%
Mean ln(PPL(Q)/PPL(base)) : 0.982393 ± 0.012943
Mean PPL(Q)/PPL(base) : 2.670839 ± 0.034568
Mean PPL(Q)-PPL(base) : 11793.015939 ± 456.326011

====== KL divergence statistics ======
Mean KLD: 0.731529 ± 0.008635
Maximum KLD: 38.478340
99.9% KLD: 19.972658
99.0% KLD: 10.503417
95.0% KLD: 3.859527
90.0% KLD: 1.866741
Median KLD: 0.077997
10.0% KLD: 0.000327
5.0% KLD: 0.000036
1.0% KLD: 0.000001
0.1% KLD: -0.000001
Minimum KLD: -0.000042

====== Token probability statistics ======
Mean Δp: -0.110 ± 0.045 %
Maximum Δp: 100.000%
99.9% Δp: 97.303%
99.0% Δp: 33.279%
95.0% Δp: 5.334%
90.0% Δp: 0.846%
75.0% Δp: 0.002%
Median Δp: -0.000%
25.0% Δp: -0.003%
10.0% Δp: -1.022%
5.0% Δp: -6.211%
1.0% Δp: -36.593%
0.1% Δp: -96.947%
Minimum Δp: -99.995%
RMS Δp : 10.494 ± 0.154 %
Same top p: 75.501 ± 0.186 %

And finally, here's the model at Q8_0 and KV cache at Q8_0.

====== Perplexity statistics ======
Mean PPL(Q) : 19075.672627 ± 634.444853
Mean PPL(base) : 7058.138346 ± 189.564955
Cor(ln(PPL(Q)), ln(PPL(base))): 92.78%
Mean ln(PPL(Q)/PPL(base)) : 0.994233 ± 0.013040
Mean PPL(Q)/PPL(base) : 2.702649 ± 0.035244
Mean PPL(Q)-PPL(base) : 12017.534280 ± 464.000151

====== KL divergence statistics ======
Mean KLD: 0.756754 ± 0.008831
Maximum KLD: 42.110668
99.9% KLD: 19.646534
99.0% KLD: 10.752383
95.0% KLD: 4.045067
90.0% KLD: 1.934487
Median KLD: 0.082384
10.0% KLD: 0.000336
5.0% KLD: 0.000039
1.0% KLD: 0.000001
0.1% KLD: -0.000001
Minimum KLD: -0.000015

====== Token probability statistics ======
Mean Δp: -0.178 ± 0.045 %
Maximum Δp: 100.000%
99.9% Δp: 96.734%
99.0% Δp: 32.610%
95.0% Δp: 5.019%
90.0% Δp: 0.722%
75.0% Δp: 0.002%
Median Δp: -0.000%
25.0% Δp: -0.004%
10.0% Δp: -1.149%
5.0% Δp: -6.318%
1.0% Δp: -36.662%
0.1% Δp: -96.659%
Minimum Δp: -99.998%
RMS Δp : 10.388 ± 0.152 %
Same top p: 75.261 ± 0.186 %

joaquinrfs

18 days ago

I ran some quick tests on Gemma 4 E2B, Llama.cpp commit 8701bf075, but I couldn't replicate your results.

Convert the Safetensor model to GGUF in both BF16 and FP16 formats
Calculate the logits generated by the BF16 and FP16 models against a corpus
- Wikipedia's full Wolf article copied and pasted into a text file
- Ran at 4096 tokens of context
Quantize Q8_0 versions from both BF16 and FP16 models
Calculate the KLD divergence of the logits generated by these models compared to the logits generated previously by the source models.

Source	KV	KLD 99	KLD 99.9
BF16	FP16	0.000037	0.000052
BF16	Q8_0	0.019366	0.075579
FP16	FP16	0.000038	0.000052
FP16	Q8_0	0.009556	0.045717
Diff¹	Q8_0	0.015760	0.088068
Diff¹	FP16	0.011851	0.055280

¹: These rows compare the BF16 model to the FP16 logits, essentially showing the divergence between both formats.

Both BF16 and FP16 seem to be just as good retaining quality when quantizing a Q8_0 from these models and run at KV FP16, but FP16 does show a minuscule advantage when its Q8_0 is run at KV Q8_0. The difference is so minimal that I wouldn't even consider it, remember that these are KLD99 and KLD99.9 metrics so they showcase worst case scenarios.

I'd love to test using the 26B A4B IT model but I don't have the memory to calculate the logits from the 16-bit GGUFs... either I use swap and take forever, or I generate logits from the Q8_0 models resulting from the BF16 and FP16 GGUFs and compare the two. If the behavior you describe is consistent, the KLD divergences from the logits of both Q8_0 versions should be astronomical!

joaquinrfs

17 days ago

I'd love to test using the 26B A4B IT model but I don't have the memory to calculate the logits from the 16-bit GGUFs... either I use swap and take forever, or I generate logits from the Q8_0 models resulting from the BF16 and FP16 GGUFs and compare the two. If the behavior you describe is consistent, the KLD divergences from the logits of both Q8_0 versions should be astronomical!

Ended up doing exactly that with the 26B A4B model:

Convert Safetensor model to GGUF in both BF16 and FP16 formats
Quantize Q8_0 versions from both BF16 and FP16 models
Calculate the logits generated by the FP16 -> Q8_0 model against a corpus
Calculate the KLD divergence of the logits generated by the BF16 -> Q8_0 model against the previously calculated logits

This should demonstrate the direct difference between both versions of the Q8_0 models, one version being quantized from the BF16 and the other being quantized from the FP16.

Source	KV	KLD 99	KLD 99.9
Diff	FP16	0.088909	0.592613
Diff	Q8_0	8.338683	17.427561

Even though memory constraints don't allow me to test to the extent you did, the difference in logits from Q8_0 GGUFs quantized from both BF16 and FP16 is massive at KV FP16 and gargantual at KV Q8_0.

I can confirm there is something VERY WRONG going on here.

qenme

17 days ago

I'd love to test using the 26B A4B IT model but I don't have the memory to calculate the logits from the 16-bit GGUFs... either I use swap and take forever, or I generate logits from the Q8_0 models resulting from the BF16 and FP16 GGUFs and compare the two. If the behavior you describe is consistent, the KLD divergences from the logits of both Q8_0 versions should be astronomical!

Ended up doing exactly that with the 26B A4B model:

Convert Safetensor model to GGUF in both BF16 and FP16 formats

Quantize Q8_0 versions from both BF16 and FP16 models

Calculate the logits generated by the FP16 -> Q8_0 model against a corpus

Calculate the KLD divergence of the logits generated by the BF16 -> Q8_0 model against the previously calculated logits

This should demonstrate the direct difference between both versions of the Q8_0 models, one version being quantized from the BF16 and the other being quantized from the FP16.

Source KV KLD 99 KLD 99.9

Diff FP16 0.088909 0.592613

Diff Q8_0 8.338683 17.427561

Even though memory constraints don't allow me to test to the extent you did, the difference in logits from Q8_0 GGUFs quantized from both BF16 and FP16 is massive at KV FP16 and gargantual at KV Q8_0.

I can confirm there is something VERY WRONG going on here.

I had a reminder to respond to re-test today, so I'm glad you saved me some time by confirming the issue on 26B. Something is very strange here. After doing some more research it seems to be architecture choices in Gemma that are causing this. This post on reddit describes someone who wrote an inference engine and ran into many precision issues with gemma 4 - https://www.reddit.com/r/LocalLLaMA/comments/1sebwz2/got_gemma_4_running_locally_on_cuda_both_float/ Qwen3.5 and 3.6 don't suffer these same issues - https://techstat.net/qwen3-5-27b-q8-kv-cache-benchmarks-bf16-vs-f16-vs-q8_0/

qenme

8 days ago

I would love to see a response from the Google team here. I spent quite a bit of time making my original post and curating the data. I am just looking for some insight.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment