Smaller IQs performance?

#1
by neph1 - opened

Glad to see these quants, nice job!

Has anyone tried the IQ3s or IQ2s? How do they compare to the bigger quants?
I went with the IQ4_XS (because it was the only one available at the time). It seems quite a bit faster than Mixtral Q4_K_M. I had not expected that.

Hey thanks for reaching out, glad you enjoyed them! I've also found myself using IQ4_XS more lately for the nice tradeoff of quality vs. speed.

For IQ3_S specifically I've noticed a performance hit if I the entire quant doesn't fit on GPU so I've been going a size up or down. Although there were some performance improvements to offloading to CPU after this was merged, there's still a noticeable drop in t/s compared to using the traditional K quants if you are offloading at all to CPU.

Here's a quick t/s comparison between IQ2_S (Fully offloaded to GPU), IQ3_XXS / IQ4

  • IQ2_S - 33/33 GPU Layers - 7/8 Threads - 38.94 t/s

  • IQ3_XXS - 22/33 GPU Layers - 7/8 Threads - 7.02 t/s

  • IQ4_XS - 18/33 GPU Layers - 7/8 Threads - 6.1 t/s

For reference I'm using a 4080 SUPER 16GB with an older processor and RAM combo i7-9700k / 64GB DDR4. Your exact performance may vary depending on your specs.

As for quality, coherence was passable even at IQ2_S (testing shows that going any smaller yields completely incoherent output - at least with current quantization methods).

Always curious to hear others' experiences with longer form chats but hopefully this gives you a rough idea of how the smaller sizes fare. Also your comment was a good reminder for me to update the model card with the latest sizes!

Thanks for the extensive reply!

For reference, I get about 4.5 t/s with IQ4_XS, 11 layers offloaded, DDR4 ram and a short context.

Sign up or log in to comment