SanjiWatsuki/Silicon-Maid-7B high perplexity

#70
by yttria - opened

I would like to report an issue with the perplexity of mradermacher/Silicon-Maid-7B-GGUF. After noticing that it produces subjectively worse results than RichardErkhov's and TheBloke's versions, I decided to test its perplexity to investigate.

mradermacher/Silicon-Maid-7B.Q8_0.gguf > PPL = 15.3851 +/- 0.13562
RichardErkhov/Silicon-Maid-7B.Q8_0.gguf > PPL = 6.9562 +/- 0.04647

It is clear that there is a significant difference in performance. I would appreciate it if you could look into the cause of this issue and determine if other models are also affected, to ensure the quality and reliability of your quants moving forward.

yttria changed discussion title from SanjiWatsuki/Silicon-Maid-7B perplexity to High perplexity
yttria changed discussion title from High perplexity to SanjiWatsuki/Silicon-Maid-7B high perplexity

Static quants should be pretty much identical, other than depending on the version of llama.cpp used. If there are significant differences, this is a llama.cpp upstream issue and must be reported there.

mradermacher changed discussion status to closed

A cursory look shows that the tokenizers are quite different. Unless the model changed over time, this would indicate an issue with llama.cpp's conversion script. Given that llama.cpp think what we do here is useless, good luck getting that fixed.

Galunid showed there is no issue with llama.cpp or the conversion script. https://github.com/ggerganov/llama.cpp/issues/7550#issuecomment-2132547918

Good news then, it's fixed.

The issue with your quants is not fixed, which is what is reported here. A significant number could be affected.

There is no issue with my quants - they were made with an older version of llama.cpp that generated worse results than current ones. That's true for practically tens of thousands of quants on huggingface. You can request a requant with the current version if you wish, and I will consider it, but fuzzy whataboutism is not going to help. I wish I could just remake the petabyte of quants every time llama.cpp has a bugfix or improvement, but not being able to do that doesn't invalidate older quants.

And if you want to go hunting and make a list of affected models, be my guest - I can try to requant them as well, hoping they get better.

I remade a q8_0 quant with the version of llama.cpp from the day you made your quants, and it has no perplexity issues.

Then we have a good window of when it had to have been fixed - I usually update llama.cpp at least once per week. Or it's still buggy but trigger conditions are more complex - there are essentially no user-accessible knobs in the process.

I have requeued this model, to see if the current llama.cpp generates the same tokenizer output with the set-up from then. Should be done in a few hours.

The tokenizer in the new quant looks like the newer quants from richard. Since my quantizer script didn't change, and I used the same settings (which are record in the model card), this shows its a bug in the older version of llama.cpp in use at the time. If you want to track down more models, I would suspect that other mistral or mixtral models might be good candidates. It's unlikely that a lot of models are affected, as at the time, the converter scripts were severely reworked for the llama 3 tokenizer issues.

(I did not measure the perplexity)

Found that mradermacher/Mixtral-8x22B-Instruct-v0.1-i1-GGUF has more than double the perplexity of miqudev/miqu-1-70b even though they're supposed to be similar in performance. I didn't have time to run the whole perplexity test, but the first few values are usually enough for comparison. Not sure if it's the quant or if Mixtral-8x22B has naturally high perplexity. Can you please check the perplexity of the unquantized version? Just the first few values would be enough.

mradermacher/Mixtral-8x22B-Instruct-v0.1.i1-Q4_K_M.gguf
[1]6.9369,[2]12.4542,[3]12.9459,[4]9.7428,[5]7.9167,[6]7.1384,[7]6.8369,[8]6.6349,[9]6.5357,[10]7.1619,[11]7.7357,[12]7.9175
Average: 8.3298

miqudev/miqu-1-70b.q4_k_m.gguf
[1]3.0451,[2]3.3347,[3]3.8709,[4]3.6039,[5]3.6805,[6]3.6812,[7]3.7841,[8]3.8242,[9]3.9701,[10]4.0454,[11]4.1749,[12]4.2203
Average: 3.7696

./perplexity -f wiki.test.raw

You can't use perplexity like that. You need to compare exactly the same model.

I found a quant with higher perplexity and larger file size than an alternative, what could be the cause?

mradermacher/maid-yuzu-v8.Q4_K_M.gguf
[1]3.4803,[2]4.1019,[3]4.7694,[4]4.9542,[5]4.9439,[6]4.9415,[7]5.0973,[8]5.0721,[9]5.2269,[10]5.4655,[11]5.7022,[12]5.6706
Average PPL: 4.9522
File size: 28.6 GB

InferenceIllusionist/maid-yuzu-v8-Q4_K_M.gguf
[1]3.4700,[2]4.0915,[3]4.7481,[4]4.9293,[5]4.9214,[6]4.9241,[7]5.0803,[8]5.0560,[9]5.2099,[10]5.4496,[11]5.6941,[12]5.6544
Average PPL: 4.9357
File size: 28.4 GB

My version is probably done without quantising the output tensor, so likely has slightly higher quality and slightly larger filesize. All my older quants kept the output tensor unquantised.

Addendum: also, inferenceillusionist quantised the source twice, first to f16, then to q4_k_m (according to his model card), causing extra quality loss, while I only quantised once, further improving fidelity to the original model.

Neither should make much of a difference in practise.

Inferenceillusionist's quant has lower perplexity, which indicates his version may have slightly higher quality than yours.

You seem to confuse perplexity with quality. They are not the same. It's possible that the version of llama.cpp I used would create lower quality quantisations, but the facts are that Inf-I. quantized twice (which loses fidelity to the original model) and my version did not quantize the output tensor (which also guarantees higher fidelity to the original model). That explains the size differences and can also explain the insignificant perplexity differences, because the quants are not identical. This answers your question.

Sign up or log in to comment