Comparing sub 50GB Llama 4 Scout quants (KLD/Top P)
Big fat disclaimer, KLD is not everything, PPL is even less so, Top P is.. somewhat useful
Also huge thanks to Artus at BeaverAI Club (link to discord server: https://discord.gg/kfhWt9XeSB) for helping run the KLD for the full BF16 model, would have taken me days probably :D
Before working on Maverick, I decided to blow some compute on calculating the PPL/KLD/Top P of several small Scout quants, the ones I published, same setup but minus my PR changes (so what main would produce), and even threw in a couple of Unsloth's quants.
This is an effort to see if the PR changes I made are overall beneficial or detract. I don't love how much larger they get, we're losing some of the meaning of "IQ1_M" (which is supposed to average 1.75BPW..) and such, but nevertheless I figured it was worth finding out if these changes are worth pursuing and applying to Maverick
Raw data (I'm so sorry mobile users):
Measurement | IQ1_M (mine) | IQ1_M (main) | IQ2_XXS (mine) | IQ2_XXS (main) | IQ2_S (mine) | UD-IQ1_M (unsloth) | Q2_K_L (mine) | Q2_K_L (main) | UD-Q2_K_XL (unsloth) | IQ3_XXS (mine) | IQ3_XXS (main) |
---|---|---|---|---|---|---|---|---|---|---|---|
Size (GB) | 26.32 | 24.57 | 30.17 | 28.56 | 34.34 | 35.4 | 44 | 40.57 | 42.6 | 44.96 | 41.66 |
Mean PPL | 11.81 | 13.79 | 10.55 | 11.66 | 9.85 | 10.30 | 9.02 | 9.88 | 9.31 | 9.266434 | 9.76184 |
KLD | |||||||||||
Mean | 0.691 | 0.933 | 0.464 | 0.664 | 0.361 | 0.376 | 0.217 | 0.332 | 0.185 | 0.164 | 0.244 |
Max | 17.819 | 23.806 | 26.647 | 26.761 | 17.597 | 21.264 | 24.180 | 17.556 | 23.286 | 28.166 | 25.849 |
99.9% | 9.912 | 10.822 | 7.897 | 10.029 | 6.693 | 6.995 | 11.729 | 12.766 | 4.213 | 4.232 | 4.964 |
99% | 5.463 | 6.250 | 4.084 | 5.094 | 3.237 | 3.560 | 2.108 | 2.966 | 1.844 | 1.600 | 2.178 |
median | 0.315 | 0.503 | 0.187 | 0.336 | 0.141 | 0.131 | 0.067 | 0.125 | 0.060 | 0.056 | 0.099 |
10% | 0.0053 | 0.0099 | 0.002 | 0.004 | 0.0012 | 0.0012 | 0.0005 | 0.0009 | 0.0004 | 0.0004 | 0.0005 |
5% | 0.00097 | 0.00179 | 0.0003 | 0.00064 | 0.00019 | 0.00018 | 0.00008 | 0.00013 | 0.00005 | 0.00005 | 0.00007 |
1% | 0.000046 | 0.000073 | 0.000011 | 0.000030 | 0.000007 | 0.000007 | 0.000003 | 0.000004 | 0.000001 | 0.000001 | 0.000002 |
Delta probs | |||||||||||
Mean | -8.03% | -10.30% | -4.62% | -6.70% | -3.38% | -3.46% | -2.14% | -2.37% | -1.38% | -1.13% | -1.57% |
Max | 99.67% | 98.73% | 99.81% | 99.81% | 99.13% | 98.90% | 99.88% | 99.81% | 99.83% | 99.91% | 99.89% |
99.9% | 77.40% | 79.77% | 76.36% | 79.42% | 75.03% | 76.59% | 69.34% | 75.65% | 69.69% | 65.60% | 71.73% |
99% | 42.37% | 47.40% | 41.62% | 47.11% | 40.06% | 40.50% | 32.34% | 41.88% | 33.46% | 31.38% | 37.88% |
95.00% | 15.79% | 18.51% | 16.32% | 19.86% | 16.05% | 15.56% | 12.41% | 17.30% | 12.83% | 12.71% | 16.04% |
90.00% | 6.59% | 7.56% | 7.69% | 9.05% | 7.62% | 7.33% | 5.92% | 8.86% | 6.43% | 6.50% | 8.23% |
75.00% | 0.16% | 0.13% | 0.44% | 0.35% | 0.54% | 0.51% | 0.53% | 0.89% | 0.70% | 0.70% | 0.86% |
Median | -0.78% | -1.21% | -0.18% | -0.42% | -0.09% | -0.09% | -0.03% | -0.02% | -0.01% | -0.01% | -0.01% |
25.00% | -11.66% | -15.85% | -6.11% | -9.93% | -4.65% | -4.56% | -2.86% | -3.40% | -2.11% | -1.96% | -2.66% |
10.00% | -35.57% | -46.38% | -23.74% | -34.08% | -19.19% | -18.97% | -12.61% | -16.60% | -10.76% | -10.12% | -13.68% |
5.00% | -56.91% | -68.67% | -40.94% | -53.40% | -33.86% | -34.31% | -23.01% | -30.06% | -20.07% | -18.53% | -24.41% |
1.00% | -91.25% | -95.39% | -80.42% | -87.98% | -70.51% | -73.12% | -55.83% | -67.16% | -49.11% | -44.35% | -53.65% |
0.10% | -99.61% | -99.87% | -98.74% | -99.76% | -95.85% | -95.98% | -99.92% | -99.92% | -82.64% | -78.71% | -86.82% |
Minimum | -100.00% | -100.00% | -100.00% | -100.00% | -99.95% | -99.99% | -100.00% | -100.00% | -99.90% | -100.00% | -100.00% |
RMS Δp | 23.63% | 27.63% | 19.13% | 23.06% | 16.88% | 17.16% | 13.55% | 16.31% | 12.16% | 11.30% | 13.69% |
Same top | 68.58% | 62.65% | 74.02% | 67.77% | 76.74% | 77.00% | 82.92% | 77.85% | 83.42% | 84.28% | 80.08% |
Image of the above:
https://i.imgur.com/35GAKe5.png
EDIT: Messed up some of the lower calculations! (that's why i included the raw data haha..) here's an updated image:
https://i.imgur.com/hFkza66.png
I also added a logit for the Top P for the size (and made it clearer by multiplying by 100 after), since I think this paints a more clear image for Top P.. Obviously if the model is extremely tiny but sometimes gives the right answer, it'll get a super high Top P/GB, but as the Top P gets closer to 100, that's where the differences matter more. The logit calculation gives a better picture of the differences IMO
I added at the bottom some "metrics", like 1/PPL/MB (since GB was a tiny number)
For all of these, bigger is better (I inversed PPL, KLD, and RMS to get meaningful results, since smaller per GB is a weird metric to look at)
I added some colour to highlight a few things, but DON'T read too much into them, it's purely informational. I can't REALLY say which values are more important (though I will say PPL itself seems pretty useless when even the full BF16 model got over 8)
KLD, RMS, and Top P are all relevant regardless of the PPL, simply because they tell you how similarly a quantization performs to the full model weights. This doesn't mean that one that's closer is strictly better, just more similar
And I share the full information because there are distinct sections where each quant performs admirably
In terms of performance per GB, my IQ3_XXS seems to come out on top, but it has by far the worst MAX KLD value.. That's not super concerning since the 99.9% is very reasonable, but it's worth noting that no quant is best across the board
More than anything it looks like my IQ3_XXS and Unsloth's UD-Q2_K_XL are the kings of sub 50GB, trading blows through the chart, mine being 2.36 GB bigger
And if you need even less weight, my IQ2_S and Unsloth's UD-1Q_M are similar, with Unsloth's being 1.05GB bigger
Anyways, hope someone finds something interesting in the charts!