mradermacher/falcon-180B-WizardLM_Orca-GGUF

Mar 24

it's interesting to see how the performance for this large model degrades over the quantization. Intuitively, I'd guess that a bigger model as this here is more robust to lower precision than smaller models, i.e. that a 2 bit 180b is better than a comparable 2 bit llama2-70b .
where is the parity, i.e. perplexity wise and perhaps even on benchmarks?

KnutJaegersberg

Mar 24

oh I'm actually asking for the weighted version, not the raw version here

KnutJaegersberg

Mar 24

qualitatively, 120b merges of 70b models seem more robust than the original 70b models for 2 bit quants

KnutJaegersberg

Mar 24

the actual question is another one: where are the performance sweetspots, with a given amount of vram?
so far, I see that 8 bit 30b models or 4 bit 70b models seem to be the best option on my machine. it seems obvious that 2 bit destroys a lot, even with weighting, yet I wonder if there is an unknown option further upwards the parameter count ladder.
I would feel a 3 bit merge in 90b-100b range could be another good performance tradeoff, with newest methods.

mradermacher

Owner Mar 24

many good questions with no answers from me :) While I originally had plans for a series of improvements backed by metrics, I got swamped with my new goal of quantizing models that others don't.

I can only offer anecdotal data: indeed, word on the street is that larger models survive lower quantizations better, which I personally think is likely because current largewr models are not as optimized as many of thew smaller ones, i.e. they are redundant. And quantization reduces only one quality of possibly more.

OTOH, the 2 bit quantizations of large models do not convince me most of the time (I did play with goliath a bit in Q2_K, and it was... kinda usable), and no IQ1_S has ever convinced me - they were all more or less unusable. That's true even for the larger models, such as grok or even the largest ones. It probably also makes a difference whether the model is a moe or not.

I also have yet to see a metric that adequately matches human experience with a model - all current metrics are pretty much only good to show that "something degrades", but not by how much in terms of actual fidelity - people are good with braindamaged and severly altered models as long as they sound interesting and coherent. And the latter is what seems to survive better in larger models in low quants.

Anyway, this is all anecdotal.

I don't even know how large the influence of the imatrix training data is. Or the imatrix itself. For example. Quant-Cartell uses precomputed imatrix data for diffeerent models (e.g. the base mixtral imatrix for a mixtral-derived model), and according to k-l-divergence, they get much better results than doing their own imatrix. OTOH, they recently compared one of theirs with one of mine, and the k-l statistics were pretty much identical, so maybe they just somehow bungle up imatrix training? But it also means that you cna apply an imatrix for one model and successfully apply it to another, if they are similar enough, at least according to k-l-divergence, although I find that fishy. But the wrong matrix seems clearly better than the wrong imatrix generation method.

So many questions. And I am too busy keeping the machine running - I can't even try out any new models, I am still stuck with playing around with QuartetAnemoi when I have time :)

Anyway, with regards to quants - the lowest quants I personally touch (if I can't avoid it) are IQ3_XXS. Any IQ2* Q2* IQ1* are almost always noticable worse, even when I try them out on really large models such as the clown-truck (I don't really have the means to effectively use large models myself).

Also, kronos-670b incoming. I'll probably do an imatrix from the Q2 there, for lack of cpu, diskspace and memory :) exciting times.

KnutJaegersberg changed discussion status to closed Mar 24