This discussion has been hidden
Can't say I am thrilled to download and convert these big models just for one quant, but you definitely earned it :) They should be there already.
usually Q4_1 has lower perplexity than Q4_0 on every value.
Well, we've been telling you: Q4_1 is well known to be very unstable. That was one of the reasons it was abondoned: it is often larger and worse than Q4_0. The reason I added it was because you convinced me of the usefulness in certain situations (speed with metal). This is just a data point that the problems were not fixed in recent versions.
The Q4_1 quant was done with the same imatrix, but using the current version of llama.cpp.
We will have to accept that Q4_1 quants have the potential to turn out worse than Q4_0 quants. For static quants my measurements so far even indicate that this usually is the case. Now while for wighted/imatrix quants Q4_1 is usually much better than Q4_0 you cannot relay on imatrix training working that well for every model. There are just some models/architectures that see less improvement from wighted/imatrix quants in which case Q4_0 might still beat Q4_1 however I would expect such outliers to be quite rare and I would expect such cases to mainly happen for non-English models due to ouer imatrix dataset beeing English focused.
In this specific case I wouldn't count too much on perplexity measurements. They are one of the worst measurements llama.cpp gives you especially when comparing quants of almost the same quality. Instead use KL-divergence and token probability measurments and see if they lead you to the same conclusion.