4080 / 16 GB?
Hi ProphetOfBostrom, thanks, very interesting quant! Sad news is, that I only have 16 GB on my 4080. Do you think it would be ... possible to create an even smaller version to test things out?
Probably it will not be very good then, I guess -- but I'm curious nevertheless.
Hello, I've been ignoring my notifications. Sorry this is a little incoherent. Just tell me it's nonsense and I'll try again after a nap. The answer is... almost yes. I know it'd been a while and you might not care now so if you've found something else to amuse you go nuts. i gave up on HQQ uploads i had promised (not really) to do to play VTOL VR.
But you should know:
x < 15GB mixtral HQQ is new and is fine, and I can help you get GGUF IQ2_S or IQ3_XXS or w/e mixtrals going. My chief limit here is VERY LOW UPLOAD SPEED. So I can help with advice/guidance/command line (i'm bored) and small files and piles of GPU time for imatrix calibration or exl measurements (maybe, ifit's also bored) is fine. Dealing with piles of >15GB files which never actually finish uploading...maybe....
I would first point you to exl2 quants just because they're currently the market leader on this. I don't use exl2 at that low precision, but it was fine with HQQ so 2.4 bit exl2 might be okay? u can use FP8 kv cache to halve its already small size. there's very little overhead from compute with exl2. so kv and model is all you need to consider.
Could such a thing be made? Yes!
There is a way to create a HQQ mixtral which uses 14gb vram now. It's new. Here's the official ones.
This doesn't involve any quality loss, it just has an architectural change that keeps 'metadata' (whatever that means) off the GPU. The file is the same size, but you use some RAM instead. So yes, this repo is technically out of date now, I think.
**Is it possible to run a high quality mixtral within 16GB of VRAM? Yes! **
llama.cpp. Even before imatrix quants, mixtral was faster than a dense model would be with partial offloading. Honestly try koboldcpp and GGUF and see how you go. Pick by file size. Mis(x)tral has a small kv buffer and you should keep it in main RAM if there's any shortage of VRAM (--nkvo on llama.cpp, "no kv offload" plainly in kobold.cpp).
Try these:
https://huggingface.co/ycros/BagelMIsteryTour-v2-8x7B-GGUF/tree/main
Note that missing from this list is some of the newest. IQ2_S and IQ2_M. However, as you can see from the link below, you can make them yourself - because you have a Q8 (try running this too, q8's okay on cpu) and the IMATRIX file! That's the tuning done for you in advance - which needs a lot of vram. Now you can just quantize q8 with the --imatrix (file.imatrix) switch and 28, 29, or even 23 (see how much gpu offload you survive. mixtral is good about this but imatrix quants aren't, supposedly.
Noting that mixtral is 46.7B params (weights, w) check out the highlighted GGUFs not present that I recommend. You have everything you need, quantize just runs on the CPU in a few minutes.
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
21 or Q2_K_S : 2.16G, +9.0634 ppl @ LLaMA-v1-7B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
IQ2_S is strong. it's one of the newer ones. It should be 15 GB. with a small ctx you can fit a compute buffer- (if you're smart enough to be running headless it could be a big one. )
If you dropped just a few more layers (the first is the most important and least impactful - but if any others are marked nonlinear they're fast enough on the CPU too, that would let you have a long context (KV offloaded and in RAM with the nonlinear output). You've now offloaded so much memory that your pregen will still be fast because you can have a 512 or even 1024 (dont quote me) size cublas buffer. which speeds up prefill IMMENSELY without KV offload (the fewer chunks the GPU has to fetch the KV buffer in, the better. its communication lag that kills speed here, not the capacity to move 5 gigabytes of kv cache to the GPU in the blink of an eye.
Am I going to make one of these for you?... maybe?
If I knew people wanted more hqq models (I'll certainly do the KoboldAI/Erebus Mixtral if I did any others) I'd go for it. They're supposed to be much faster than when I used them. However if you have no preference over GGUF then i stopped because llma.cpp and exl2 became good choices for them on a 3090.
oh and trying to understand what the f*ck is going on in the HQQ repo made me cry actual tears. there is NO DOCUMENTATION. MOST OF THE COMMENTED CODE DOESN'T RUN. the change logs are cryptic, idk how many people this project actually involves. i don't know what they want (at one point it came across like they wanted the HQQ method to be able to quantize to existing formats? it'd be killer in exl2.)
and god knows I don't know what "scale_zeros" means.
Imatrix quants (there's even an 1.73 BPW) are very good, and kind of a rarity right now. I suggest downloading some different sized mixtrals of this type and giving them ago on
Here's me throwing a temper tantrum yesterday because I couldn't upload quants of a model which it turns out had been compiled by someone who can't read README.md on their training suite and had a broken Yi tokenizer anyway. Why do these people have supecomputers? I spent a lot of time trying to upload these (still well functional) models. nothing to show for it but, again, the imatrix and some notes. I might try again. Thicker people than me manage with HF. That being said: I'd also be very happy just to help you do it yourself or provide the imatrix (which is the bit which benefits from the 3090's memory)
I'll keep an eye out for a reply now, so if you still want help getting this model or similar running. better than ramming my head against git_lfs for no reason. this is getting too long. peace.