|
# Initial Testing 2024-04-25 |
|
|
|
Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models. I don't disagree. |
|
|
|
Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead |
|
of converting to fp32 as an intermediate step. I think that's bollocks. There is no logical or mathmatical justification for |
|
how that could possibly matter. |
|
|
|
So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways: |
|
- fp16 specifically with `--outtype f16` |
|
- fp32 specifically with `--outtype f32` |
|
- "Auto" with no outtype specified |
|
|
|
I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw |
|
text file. The results: |
|
|
|
```` |
|
FP16 specified: size 14.9GB PPL @ fp16 9.5158 +/- 0.15418 PPL @ Q4km 9.6414 +/- 0.15494 |
|
FP32 specified: size 29.9GB PPL @ fp32 9.5158 +/- 0.15418 PPL @ Q4km 9.6278 +/- 0.15466 |
|
None specified: size 29.9GB PPL @ ???? 9.5158 +/- 0.15418 PPL @ Q4km 9.6278 +/- 0.15466 |
|
```` |
|
|
|
|
|
As you can see, converting to fp32 has no meaningful effect on PPL compared to converting to fp16. PPL is identical at full weight, |
|
and the miniscule loss shown at Q4km is will within the margin of error. There will no doubt be some people who will claim |
|
"PpL iSn'T gOoD eNoUgH!!1!". For those people, I have uploaded all GGUFs used in this test. Feel free to use those files to do |
|
more extensive testing on your own time. I consider the matter resolved until somebody can conclusively demonstrate otherwise. |
|
|
|
|
|
# Continued Experiments 2024-05-11 |
|
|
|
As an imatrix enjoyer, it has been bugging me whether the precision of the quant used to generate the imatrix actually |
|
matters. Scuttlebut says "yes, but only a little". Logically, I don't think it should matter to a meaningful extent. PPL |
|
scales, so a value that is relatively important at fp16 should also register as relatively important at Q8 or even Q4. |
|
|
|
To test this theory properly, I took failspy/Llama-3-8B-Instruct-abliterated and converted it to GGUF in both fp16 and fp32 |
|
formats. I then quantized each of those GGUFs to both Q8_0 and Q4_0. I then generated imatrices for each of those six |
|
GGUFs. Then I created eight GGUFs quantized at Q4_k_m: |
|
|
|
- fp32 GGUF, fp32 imatrix |
|
- fp16 GGUF, fp16 imatrix |
|
- fp32 GGUF, fp32->Q8 imatrix |
|
- fp16 GGUF, fp16->Q8 imatrix |
|
- fp32 GGUF, fp32->Q4 imatrix |
|
- fp16 GGUF, fp16->Q4 imatrix |
|
- fp32 GGUF, no imatrix |
|
- fp16 GGUF, no imatrix |
|
|
|
I ran PPL against all 8 quants, as well as the full fp16 and fp32 GGUFs. All iMatrices were created using Kalomaze's |
|
groups_merged.txt. All PPL calcs were run using wiki.short.raw. Results: |
|
|
|
```` |
|
GGUF PPL |
|
FP16 11.5923 |
|
FP32 11.5923 |
|
Q4km FP16 + FP16 imat 11.9326 |
|
Q4km FP32 + FP32 imat 11.9314 |
|
Q4km FP16 + Q8 imat 11.9369 |
|
Q4km FP32 + Q8 imat 11.9500 |
|
Q4km FP16 + Q4 imat 11.9355 |
|
Q4km FP32 + Q4 imat 11.9356 |
|
Q4km FP16 no imat 12.3612 |
|
Q4km FP32 no imat 12.3643 |
|
```` |
|
|
|
Conclusion: |
|
|
|
Importance of quant size used to generate imatrix is borderline non-existant. Sort of. While the Q4km quant made with |
|
the fp32 GGUF and the fp32-generated imatrix was best, it was by such a miniscule margin that it is implausible that any |
|
difference between that (11.9314) and the Q4km made from the fp16 GGUF with the Q4_0-generaged imatrix (11.9355) could be detected |
|
under normal usage. The only counterintuitive result here is that the Q4_0-imat quants outperformed the Q8_0-imat quants. I cannot |
|
think of a reason why this should be the case. But as it seemingly *is* the case, I will be using Q4_0 as my intermediate step for |
|
generating imatrices in the future when the full fp16 model is too big for my measly 72GB of VRAM. |