Compilade
AI & ML interests
Recent Activity
Organizations
compilade's activity
KLD measures the difference between 2 probability distributions, typically between a "ground truth" and a model prediction.
Yes, and ln(PPL(Q)/PPL(base))
from my understanding measures the difference between the probabilities for the "correct" tokens according to the test dataset (at least for the second half of each chunk (same as for KLD)). Which means it would be possible to somehow keep perplexity the same or better while also increasing KLD (by making the non-"correct" tokens have different probabilities).
This makes me wonder: do all of the token probabilities have to match closely for a quantized model to still be good?
I guess it depends on whether the goal is to make a faithful quantization, or an equally good model through quantization-aware fine-tuning.
The way imatrix
works, it can't really "fine-tune" a model towards a lower perplexity, only prioritize error reduction in the quantization of the weights in the columns with more impact on the activations, so I would say that faithfulness to the full-precision model is the goal of the quantization in this case, and thus KLD feels more appropriate.
Of course, I might be wrong; I don't really have a full understanding of the statistics going on in perplexity and KL-divergence calculations.
However, for quantization-aware fine-tuning, then ln(PPL(Q)/PPL(base))
is likely a better indicator of a better quantization than KLD, unless the goal of the fine-tuning was actually to minimize KLD.