K-quantisation

#1
by ProphetOfBostrom - opened

What's the word on k-quants? I never found any mention of an issue on the llama.cpp github (or koboldcpp for that matter) - and you're publishing them (Q4_K_M) now.
What was the story there? Bug with llama.cpp's inference or bug with the GGUF quantisation in the first place?
...was there a bug at all?
Just being picky because my download speeds are not fast. I love the work you guys do and I don't want some scuffed nibbles* tainting your delightful transformers.
p.s. did you keep the original output tensors? if you did I might get the Q8_0 (isn't this also technically a k-quant now?) and re-quantize it down to Q4_K_S or Q3_K_L. 24 GB sweet spot right?

maybe I should just bite the bullet and see if I can QUIP# or HQQ one of these? I'm pretty confident that those would be an improvement on Q2_K at least. Contrastive search from the native Transformers support hugely improves RP output quality. Exllama2 and llama.cpp both seem uninterested in supporting this sampling mode, which means that AWQ and GPTQ (and BnB) are now actually offering better quality generation than the cool kid quants (IMO - but try contrastive search if you haven't!)

NeverSleep org

Well I'm a little lost, some say it's fixed, other say it isn't.
For me, it's just that MoE take quantization very badly, so if you want to be sure, stay on Q4_0, Q5_0 or Q8_0

Thanks for getting back to me Undi - what you're describing has in principle been documented with HQQ already (bottom of this reply) and now I'm interested.
I'm gonna go ahead and pull the full HF model (unlimited LTE ain't quick - but it's unlimited!) and fiddle with numerous programs all called "quantize". Curiosity is currently killing this cat.
I'll go ahead and use Noromaid as a surrogate for base mixtral primarily because I am controlled not by a living human soul, but by sin and linear rectification.

Hopefully I'll have something a worthwhile to contribute (perhaps a strong 2 bit quant!) But I probably won't - so if the person reading this thinking "noromaid 2 bit HQQ sounds cool" - do it yourself because I'm probably not gonna deliver.
if I don't, I may just upload a (properly) compressed tar/archive of the .safetensors, they should be super compressible and they'll definitely save someone at least some time. I'll provide checksums because I'm not a silly billy.
Needless to say: now I've made many promises - expect nothing. You will never see me again. Goodbye.

I get that this is probably not much use to you but it's relevant and maybe someone googling this might find it more insightful some day. It's worth following those github links a few levels because there's clever people who've actually performed experiments and recorded their results like good data scientists (we're all data scientists here, right, Neversleep?)
For the 2 bit HQQ quantization of mixtral, the initial all-2-bit (I think? maths is hard) quantization was weak, but this change seems to retain a much stronger model:

More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit. This model should perform a lot better compared to the all 2-bit model for a slight increase in model size (18.2GB vs. 18GB).

as seen:
https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ

Low effort to test idea which could be the whole problem. Probably not but very quick to test... if you have the quants all ready.
What kind of issues do you have? Is it just behaving like it's lost too much precision or is it actually malfunctioning and producing complete gibberish? The latter might be a heisenbug involving either/both the tokeniser or how quantize(.exe, if you insist) may be different when llama.cpp as a whole is compiled with additional flags (like cublas)

A description of the issue you had (or better, just some example bad output) would be invaluable to whoever's gonna figure this out. It probably won't be me but I'll try.

Have you tried using one your bad quants with the llamacpp_HF loader in texgen_webui? Use your tokeniser files from your merge repo (not the stock llama one that tgwui provides) then compare to the llama.cpp loader next to it.
Pure speculation, but I've seen reports of k-quants and busted llama.cpp tokenizer output before so worth a few minutes testing side-by-side right? Set a fixed seed on load (17, always use 17 for fixed seeds. don't ask, do google.) for the same .gguf and my understanding is that they shouldn't diverge? Certainly not in a way that makes you think "broken". If they do, you've found a real bug! (I've seen some reports that a small divergence can happen with the fast tokeniser even for a full fat transformer - but it's not a case of one being more or less correct than the other.)
I'd do it myself but I've committed the next 90gb of my life to downloading 16 bit files I intend to delete 14 bits from.

ProphetOfBostrom changed discussion status to closed

Sign up or log in to comment