blocky blocky blocky

#1
by mclassHF2023 - opened

This is probably not the GGUF's or anyone's fault, but I run into this "blocky blocky blocky" issue on oobabooga and can't test the unquantized model.
It seems to run in LM Studio, so I assume that oobabooga just has to update something. Just wanted to know if others are also running into this, and if so I can suggest LM Studio for now.

Probably a lack of update but also I think you need to avoid CUDA offloading for now

I didn't do any offloading to CPU, if you mean that?
I tested out an exl2 quantization at 4bpw and that worked perfectly. So I think it's probably something related to Text Generation WebUI and a missing update of a library (llama.cpp or something).

No you want to do no offloading to GPU, aka leave it all on your CPU

It can appear as a bug in exl2 as well but not sure why it doesn't always appear

You can also enable flash attention for llamacpp which should be able to work around the issue

Sign up or log in to comment