GGUF (Q4_K_M only) outputs gibberish

#3
by sergkisel3v - opened

GGUF outputs random characters in koboldcpp 1.67 or latest oobabooga.
Testing split Tesla P40 + Ram with q4_k_m
Enabling flash attention, disabling mmq doesn't help.

probably qwen2 support problem in general?

related:
https://github.com/LostRuins/koboldcpp/issues/909
https://github.com/ggerganov/llama.cpp/issues/7939

koboldcpp 1.72

Are you from the future? The latest version is 1.67... I converted first to bf16 and then to Q6_K and it works in kobold 1.65.

ah yeah, 1.67 - typo.

I checked with last pure llama.cpp and it doesn't work too.

The latest version is 1.67... I converted first to bf16 and then to Q6_K and it works in kobold 1.65.

Are you splitting between GPU and CPU? Because without splitting it appears to work. Also maybe it works only on rtx cards.

0 layers on gpu, cublas enabled.

Probably it's the problem when splitting/gpu only inference.

Also I downloaded gguf from alpindale repo

This comment has been hidden

This comment has been hidden

I tried both with enabled/disabled flash attention

Seems like yet another person has this problem here https://github.com/LostRuins/koboldcpp/issues/909

He said that model output gibberish on q4km with or without offloading.

Maybe problem in q4km quants. I can't test Q6_K

image.png

Q4_K_M on different PC - ram only.

Doesn't work too.

IIRC, that ^ in textegen is a sampling setting issue. I remember getting it last year.

FWIW - the exl2 quants work.. the <3.0 quants output random Chinese text sometimes. I made myself an exl2 5BPW and it's great. Feels like a dumber version of claude3-opus.

IIRC, that ^ in textegen is a sampling setting issue. I remember getting it last year.

I tried with different samplers settings on different backends. It doesn't work no matter what.

Btw, i switched to IQ4_XS and it works great. The model is really good too, better then llama finetunes i think.

Cool, hopefully that helps others using GGUF ^

Yeah it's a great model. I don't like any of the llama3 models so far.

sergkisel3v changed discussion title from GGUF outputs gibberish to GGUF (Q4_K_M only) outputs gibberish

I get similar issues with EXL2 4.5BPW... Not complete gibberish, but will often switch up PoV. Example will sometimes speak in the first person I look at you and sometimes speak in 3rd person She speaks to him, and even sometimes just doesn't speak correctly adjusts hands. wrinkles skirt. ... Then randomly will spew math problems in the response. Ex. I look at you and smile 3+2=5 squares finished "Hi there"

Some weird stuff going on haha

Oh I haven't had that issue. I assume you're using ChatML like the model card says?

I get similar issues with EXL2 4.5BPW... Not complete gibberish, but will often switch up PoV. Example will sometimes speak in the first person I look at you and sometimes speak in 3rd person She speaks to him, and even sometimes just doesn't speak correctly adjusts hands. wrinkles skirt. ... Then randomly will spew math problems in the response. Ex. I look at you and smile 3+2=5 squares finished "Hi there"

Some weird stuff going on haha

I was having the same exact issues. I read elsewhere about changing up sampler settings, and similar to what @alpindale posted in separate thread is exactly what fixed it for me.

He recommends using min P ~0.06 and temperature 1.1 ish only.

What I did was neutralize all samplers, and what I have set is Min-P to 0.1, Dynamic temperature at 0.8-1.6 / 1.45 exponent, smoothing factor 0.25 / smoothing curve 1.85, and temperature last.

The model immediately became unbelievably good, where as before with my normal Llama 3 sampler settings it just... felt very off, despite still seeing how smart it could be.

Sign up or log in to comment