exl2 vs GGUF at 8bit

#2
by Samvanity - opened

hello,

I see that you have both the gguf and exl2 quants available for this model... in your opinion, what are the differences you observe between these 2 different quants? exl2 seems quite a bit faster at 8bit but in theory the quality of their generations should be similar, right? I tried both and couldn't really tell.

Is it ok to assume as long as you can fit an entire 8 bit model + the context length into VRAM, you should choose exl2 over GGUF, but if you can't and have to choose lower bit quants, then GGUF with iMatrix should work better?

I tried quite a few of your quants and found them to be excellent... thank you for all the hard work and contributions to the open source community!

that's basically exactly correct, good understanding

If you can fit it entirely in VRAM, go with exl2

if you want to push your capacity by offloading some work to system RAM, go with GGUF

Additionally exl2 has a nice advantage of offering Q4 quantization of context, allowing you to push your VRAM for context much further (not really useful with only 8k context, but worth keeping in mind as a notable difference)

Sign up or log in to comment