The unsloth dynamic quants of DeepSeek-R1 made some waves recently. https://www.reddit.com/r/LocalLLaMA/comments/1ibbloy/158bit_deepseek_r1_131gb_dynamic_gguf/
There has been some interest expressed in giving other models the same treatment. Weeks later, not seen much done about it. Maybe I've been looking in the wrong places? Maybe nobody has because DSR1 is particularly amenable to this treatment and there's little real payoff for other models?
Regardless, looking at what other MoE models might benefit, one very easy answer is the DeepSeek v2 model series. Mainly because unsloth's llama.cpp fork for this requires fairly little effort to modify for this use.
So, what the hell, why not. Five quants posted, (iq1_s, iq1_m, iq2_xxs, iq2_s, and iq3_m) at ~49GB through ~97GB. Not that those designations are gonna line up as neatly as one might expect with the fiddling done.
imatrix data klepped from bartowski. Thanks!
The quantization strategy is pretty simple-minded, basically just don't let the attention/output layers drop below q4_k. Is this optimal? LOL. Should still perform better than standard llama.cpp low-bit quants.
Leveraged the llama.cpp fork from Unsloth, which was apparently used to create the GGUF versions of their dynamic quants of DeepSeek-R1. https://github.com/unslothai/llama.cpp All of the changes were in llama-quant.cpp, in regards to choices made on what quant level to apply to various tensor sets, based on what the set is. A few tweaks were applied because v2.5 and R1 are similar, but not exactly the same architecture, and as mentioned above, my strategy for selecting how far to compress each layer is, ahem, less nuanced than what the fine folks at Unsloth did with R1.
Thanks all!
- Downloads last month
- 356