Comparing sub 50GB Llama 4 Scout quants (KLD/Top P)

Community Article Published April 9, 2025

Big fat disclaimer, KLD is not everything, PPL is even less so, Top P is.. somewhat useful

Also huge thanks to Artus at BeaverAI Club (link to discord server: https://discord.gg/kfhWt9XeSB) for helping run the KLD for the full BF16 model, would have taken me days probably :D

Before working on Maverick, I decided to blow some compute on calculating the PPL/KLD/Top P of several small Scout quants, the ones I published, same setup but minus my PR changes (so what main would produce), and even threw in a couple of Unsloth's quants.

This is an effort to see if the PR changes I made are overall beneficial or detract. I don't love how much larger they get, we're losing some of the meaning of "IQ1_M" (which is supposed to average 1.75BPW..) and such, but nevertheless I figured it was worth finding out if these changes are worth pursuing and applying to Maverick

Raw data (I'm so sorry mobile users):

Measurement	IQ1_M (mine)	IQ1_M (main)	IQ2_XXS (mine)	IQ2_XXS (main)	IQ2_S (mine)	UD-IQ1_M (unsloth)	Q2_K_L (mine)	Q2_K_L (main)	UD-Q2_K_XL (unsloth)	IQ3_XXS (mine)	IQ3_XXS (main)
Size (GB)	26.32	24.57	30.17	28.56	34.34	35.4	44	40.57	42.6	44.96	41.66
Mean PPL	11.81	13.79	10.55	11.66	9.85	10.30	9.02	9.88	9.31	9.266434	9.76184
KLD
Mean	0.691	0.933	0.464	0.664	0.361	0.376	0.217	0.332	0.185	0.164	0.244
Max	17.819	23.806	26.647	26.761	17.597	21.264	24.180	17.556	23.286	28.166	25.849
99.9%	9.912	10.822	7.897	10.029	6.693	6.995	11.729	12.766	4.213	4.232	4.964
99%	5.463	6.250	4.084	5.094	3.237	3.560	2.108	2.966	1.844	1.600	2.178
median	0.315	0.503	0.187	0.336	0.141	0.131	0.067	0.125	0.060	0.056	0.099
10%	0.0053	0.0099	0.002	0.004	0.0012	0.0012	0.0005	0.0009	0.0004	0.0004	0.0005
5%	0.00097	0.00179	0.0003	0.00064	0.00019	0.00018	0.00008	0.00013	0.00005	0.00005	0.00007
1%	0.000046	0.000073	0.000011	0.000030	0.000007	0.000007	0.000003	0.000004	0.000001	0.000001	0.000002
Delta probs
Mean	-8.03%	-10.30%	-4.62%	-6.70%	-3.38%	-3.46%	-2.14%	-2.37%	-1.38%	-1.13%	-1.57%
Max	99.67%	98.73%	99.81%	99.81%	99.13%	98.90%	99.88%	99.81%	99.83%	99.91%	99.89%
99.9%	77.40%	79.77%	76.36%	79.42%	75.03%	76.59%	69.34%	75.65%	69.69%	65.60%	71.73%
99%	42.37%	47.40%	41.62%	47.11%	40.06%	40.50%	32.34%	41.88%	33.46%	31.38%	37.88%
95.00%	15.79%	18.51%	16.32%	19.86%	16.05%	15.56%	12.41%	17.30%	12.83%	12.71%	16.04%
90.00%	6.59%	7.56%	7.69%	9.05%	7.62%	7.33%	5.92%	8.86%	6.43%	6.50%	8.23%
75.00%	0.16%	0.13%	0.44%	0.35%	0.54%	0.51%	0.53%	0.89%	0.70%	0.70%	0.86%
Median	-0.78%	-1.21%	-0.18%	-0.42%	-0.09%	-0.09%	-0.03%	-0.02%	-0.01%	-0.01%	-0.01%
25.00%	-11.66%	-15.85%	-6.11%	-9.93%	-4.65%	-4.56%	-2.86%	-3.40%	-2.11%	-1.96%	-2.66%
10.00%	-35.57%	-46.38%	-23.74%	-34.08%	-19.19%	-18.97%	-12.61%	-16.60%	-10.76%	-10.12%	-13.68%
5.00%	-56.91%	-68.67%	-40.94%	-53.40%	-33.86%	-34.31%	-23.01%	-30.06%	-20.07%	-18.53%	-24.41%
1.00%	-91.25%	-95.39%	-80.42%	-87.98%	-70.51%	-73.12%	-55.83%	-67.16%	-49.11%	-44.35%	-53.65%
0.10%	-99.61%	-99.87%	-98.74%	-99.76%	-95.85%	-95.98%	-99.92%	-99.92%	-82.64%	-78.71%	-86.82%
Minimum	-100.00%	-100.00%	-100.00%	-100.00%	-99.95%	-99.99%	-100.00%	-100.00%	-99.90%	-100.00%	-100.00%
RMS Δp	23.63%	27.63%	19.13%	23.06%	16.88%	17.16%	13.55%	16.31%	12.16%	11.30%	13.69%
Same top	68.58%	62.65%	74.02%	67.77%	76.74%	77.00%	82.92%	77.85%	83.42%	84.28%	80.08%

Image of the above:

~~https://i.imgur.com/35GAKe5.png~~

EDIT: Messed up some of the lower calculations! (that's why i included the raw data haha..) here's an updated image:

https://i.imgur.com/hFkza66.png

I also added a logit for the Top P for the size (and made it clearer by multiplying by 100 after), since I think this paints a more clear image for Top P.. Obviously if the model is extremely tiny but sometimes gives the right answer, it'll get a super high Top P/GB, but as the Top P gets closer to 100, that's where the differences matter more. The logit calculation gives a better picture of the differences IMO

I added at the bottom some "metrics", like 1/PPL/MB (since GB was a tiny number)

For all of these, bigger is better (I inversed PPL, KLD, and RMS to get meaningful results, since smaller per GB is a weird metric to look at)

I added some colour to highlight a few things, but DON'T read too much into them, it's purely informational. I can't REALLY say which values are more important (though I will say PPL itself seems pretty useless when even the full BF16 model got over 8)

KLD, RMS, and Top P are all relevant regardless of the PPL, simply because they tell you how similarly a quantization performs to the full model weights. This doesn't mean that one that's closer is strictly better, just more similar

And I share the full information because there are distinct sections where each quant performs admirably

In terms of performance per GB, my IQ3_XXS seems to come out on top, but it has by far the worst MAX KLD value.. That's not super concerning since the 99.9% is very reasonable, but it's worth noting that no quant is best across the board

More than anything it looks like my IQ3_XXS and Unsloth's UD-Q2_K_XL are the kings of sub 50GB, trading blows through the chart, mine being 2.36 GB bigger

And if you need even less weight, my IQ2_S and Unsloth's UD-1Q_M are similar, with Unsloth's being 1.05GB bigger

Anyways, hope someone finds something interesting in the charts!

Community

Matrix-Array

4 days ago

•

edited 3 days ago

I took the liberty of trying to add more visualization of the data. You should be able to view it here:

[Lite]
https://page.genspark.site/page/toolu_013iF9H8mPpqYyNHfi3sc8KA/performance_metrics_analysis.html

[Advanced]
https://page.genspark.site/page/toolu_018LCdFVi8w6aXChyMqERMdx/revised_quantization_performance_analysis.html

[Notes]
Edit includes an updated revision with enhanced visuals, keeping the lite version for those who want to see it. Enhanced mode is not optimized for mobile. Landscape mode fixes the issue it seems, and desktop looks fine.

bartowski

Article author 3 days ago

Awesome thanks!

victunes

3 days ago

This comment has been hidden (marked as Off-Topic)

ubergarm

3 days ago

Holy smokes this is the kind of content I signed up for! Thanks so much bartowski for writing up this deep dive into synthetic benchmarks to compare quant quality for a given model architecture.

I've been keeping an eye on your llama.cpp PR and good progress with your newer V2 models with lower perplexity.

I've only been testing perplexity, but KDL and Top P look interesting as well. Really appreciate the color coded charts @Matrix-Array including that final "metric per GiB" which shows the differences most clearly to my eye. I first heard about those from @nicoboss a week ago:

I highly recommend you measure kv-divergence, top token probability and same token probability instead of perplexity to get much better data. -team mradermacher

Its probably worth merging your PR assuming it is not effecting non MoE models. Interesting too that GG suggested llama.cpp PR#12511 which is essentially the same thing I've been using on @ikawrakow 's ik_llama.cpp with --custom-q.

From what I can tell, the mainline PR also supports tensor and row custom quants using a similar regex style as you found. The unsloth team @danielhanchen has shown for multiple models now advice about measuring activation and weight error analysis per layers to guide how to quantize given some layers seem more sensitive e.g. this gemma-3-27b example.

Anyway sorry for all the pings, just appreciative of y'all and this is all great news for the average ai enthusiast as the leading quant cookers are all improving their methods.

Cheers!

bartowski

Article author 3 days ago

I've only been testing perplexity, but KDL and Top P look interesting as well. Really appreciate the color coded charts @Matrix-Array including that final "metric per GiB" which shows the differences most clearly to my eye. I first heard about those from @nicoboss a week ago:

Yeah perplexity is definitely an interesting stat, but as you can see it doesn't tell the whole story

Llama 4 has absurdly high perplexity, but that just means it's not likely to repeat wikitext verbatim.. If perplexity improves from quantizing, that's actually more a sign that something has fundamentally changed in the model by quantizing

Did it get better? Certainly possible! But I'd rather it remain true to the source, and release as that, rather than trying to game the PPL system (not that I've seen anyone do that recently, which is nice

KLD and top P can show the similarity to the full weights which is extremely valuable, and I much prefer it overall! And you can even see why in my chart

My updated Q2_K_L has the best PPL of all the quants, however, it diverges more from bf16 than my IQ3_XXS! Does that mean Q2_K_L is better? Almost certainly not!

ikawrakow

3 days ago

Nice post, thanks for sharing. What is the test corpus for calculating PPL/KLD/etc.?

bartowski

Article author 3 days ago

First 300 chunks of wiki text, figured it was arbitrary which dataset was used from a KLD perspective, PPL may change from one to another of course (it's also weird that scout has almost 9 PPL on wiki text at full precision...)

ikawrakow

3 days ago

Here are two graphs showing mean KLD and top token probability as a function of ln(PPL). The data is taken from the above table. The correlation coefficients when fitting the data points with a straight line (the red lines in the plots) is 0.988 for KLD and -0.978 for top token probability. Statistically, this means that if I computed (or observed) one of these 3, I wouldn't need to compute (or observe) the other two as I will learn nothing new. I took the liberty to add error bars on the top token probability, this gives a more realistic view on departures from the expectation.

bartowski

Article author 2 days ago

•

edited 2 days ago

I can definitely see what you mean, I guess my concern is that it doesn't strictly have to go that way, it often and usually does of course, but even though the trend is a nice straight line the data is not as straight

I guess for me, even though PPL is a (surprisingly) useful approximation, when we're getting down to the SUPER nitty gritty (as I seem to be here), it's easier to paint the whole picture by using KLD and Top P

ikawrakow

2 days ago

•

edited 2 days ago

In case this is of interest, here are some notes on quantizing L4-Scout with ik_llama.cpp

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote