Differences between GGUF and EXL2 quants?

#3
by smhf72 - opened

I see the GGUF quants do not use the specialized parquet which is used to create the exl2 quants, but what does this mean in terms of actual behavioral differences?

I'm only able to use 34B parameter models using partial GPU offloading allowed by GGUF quants (or can exl2 quants do this as well?), due to my limited VRAM, but if the exl2 quants behave significantly better due to the use of this parquet, then perhaps it would be worthwhile to look into using CPU inferencing (my CPU being significantly newer, with 64GB DDR5).

I haven't used exl2 quants much, is there a specific backend/API that is now recommended for speed, or is Ooba still the standard go-to?

Stability: Less prone to making unnecessarily long (or sometime infinite) replies when the scene doesn't call for it, and less likely to spew junk data such as code, gibberish, random website tangents. user/character profiles word-for-word, etc.

I can only say for certain that the parquet I use is superior at least on my models when it comes to exl2 vs other exl2/gguf quants, since I don't know how to make ggufs myself. Even the ggufs I hosted on this model page that don't use my parquet still has these issues in testing (long/infinite replies, junk such as "\n\nEND" being posted) when using the same settings from the quants using Kalomaze's semi-random groups_merged.txt. And just for reference, I'm recently doing a scenario that has over 200 swipes to test out other things such as system prompting, and none of them have failed or gone off the deep end like the quants using Kalomaze's data did in only 10 test swipes.

@MarinaraSpaghetti has said they will make ggufs with my parquet sometime soon to help see if it can be confirmed it's stable across different quant practices, but no idea when (maybe this week, idk) those gguf quants will come to be.

As for backend/API, people seem to enjoy TabbyAPI, but I still use Ooba. exl2 also only uses gpu vram, and can't be offloaded to cpu.

Sign up or log in to comment