Update: Quantization is Good. Errors were due to something else.

#2
by deleted - opened
deleted
edited May 8

I'm using the Llama 3 8b Instruct provided by GPT4All that's only Q4_0 (link below), and at temp 0 it reliably hallucinating far less than all the Q4_K_M, and even Q5_K_Ms (update: even Q8_0) that I've tested using various builds of llama.cpp, such as when asking to provide a list of the main characters, and the actors who portrayed them, on various TV shows and movies.

https://gpt4all.io/models/gguf/Meta-Llama-3-8B-Instruct.Q4_0.gguf

I've been trying to figure out why this is because with all past LLMs, including Llama 2, Mistral and Mixtral, the Q4_K_M versions at temp 0 always hallucinated notably less than any Q4_0 version. This is the first time a Q4_0 version not only hallucinated less than any Q4_K_M at the fringes of an LLM's pop culture knowledge, but much less, even than Q5_K_M versions. Since it even happens with Q8_0 it's probably not due to the use of iMatrix.

A lot of people are complaining about the hallucinations of Llama 3 8b Instruct, such as on Reddit, but I don't think Llama 3 8b itself is to blame. There's got to be something wrong with how llama.cpp is handling llama 3 8b Instruct because the tiny Q4_0 above hallucinates far less than Mistral 7b, Gemma 7b or any other similar sized LLMs.

https://www.reddit.com/r/LocalLLaMA/comments/1cdmjg1/llama3_is_probably_has_the_most_hallucinations_of/

deleted changed discussion status to closed

Can I ask what your frontend is? I know that there was recently uncovered an issue where some frontends would double up the BOS token and that would completely destroy any reliability of the output.

That is very interesting that it's hallucinating so much more.

I do know that this model specifically seems to have some awkward issues that haven't fully been uncovered, I assume it's not purely related to this Gradient 1048k one?

bartowski changed discussion status to open
deleted

Shit. It was the front end. I waited for Koboldcpp to update to the latest llama.cpp thinking it was just a bug in GPT4ALL, and thought I set the setting correctly (temp 0), but for some reason when I re-checked they were still at the defaults (which are really bad for hallucinations at the fringe of knowledge, which is what I was trying to test).

I applied the right settings in Koboldcpp, restarted, then double-checked that they were properly set. Then tested Q8_0 and it performed as expected. The following is an example. It's not a perfect character list for Corner Gas, but much better than before (pasted next). It seems llama.cpp has worked out some bugs.

Notice Lacey Burrows vs Burdette (Burrows is correct), Emma Leroy vs Tarantino (Leroy is correct), and so on. Thank goodness. These hallucinations have been driving me crazy for the last two weeks. I'm glad to see it's not Llama 3 8b itself that was causing them.

Brent Leroy (played by Brent Butt)
Oscar Leroy (played by Eric Peterson)
Lacey Burrows (played by Gabrielle Miller)
Emma Leroy (played by Tara Spencer-Nairn)
Hank Yarbo (played by Fred Ewanuick)
Wanda Dollard (played by Nancy Robertson)
  1. Brent Leroy (played by Brent Butt)
  2. Lacey Burdette (played by Gabrielle Miller)
  3. Hank Yarbo (played by Peter Stebbings)
  4. Emma Tarantino (played by Nancy Robertson)
  5. Davis Quinton (played by Eric Peterson)
  6. Wanda Burdette (played by Erica Cerra)

Absolutely brilliant choice of shows to test it on, love Corner Gas :D

Glad it's working better for you now!

deleted changed discussion title from Potential Quantization Issue to Update: Quantization is Good. Errors were due to something else.
deleted changed discussion status to closed

Sign up or log in to comment