Thank you again + Request

#1
by Yuma42 - opened

I tried your IQ4_X_S and it works πŸŽ‰
I also linked to your versions in my top models so that they are easier to discover.

Since you said that you do imatrix only upon request I want to make a request for a single model: imatrix Q4_K_S for RawRuby. It's the model which I myself am using so having it improved by imatrix would be awesome.

Sure (let's just generate all - IQ4_XS is also a good target :)

mradermacher changed discussion status to closed

Oh yes I know that there seem to be a few people who use RawRuby because it got a few hearts. Thanks for linking that πŸ‘πŸ»
Unfortunately, at least the variant I tested makes typos in some specific cases. The client I'm using has a bugged prompt template, I'm waiting for that to be fixed before testing again but my Q4_K_S version does not have that problem despite the little mistake in the prompt template.

I'm going to test mradermacher/Nous-Hermes-2-Mixtral-8x7B-DPO-i1-GGUF on my phone with the most extreme quant you did, wish me luck πŸ˜‚ I don't expect it to be better than Q4 RawRuby but just to see if I can even run it.
Update: it didn't work, and I'm not surprised at all πŸ˜„

Wrong prompt templates should not normally cause typos. Low quantizations do, though, especially the very low quants (<= Q2). But also tokenizer mismatches.

If I would have to guess, if the topic is about stuff which the model doesn't know well, than the weighted imatrix variant is weaker than the basic variant and starts with typos. I will try later the weighted IQ_X_S in theory it should be weaker than the i1Q4_K_S but maybe it turns out to not have the typo problem.

Fascinating, especially if it turns out to be true.

Well I was just about to test it more but I can't reproduce the typos with the i1q4ks. Not sure if I should be happy about it or not haha. If I find something out I let you know, if you don't hear from me the typos probably stopped from appearing.

Ok I could reproduce. It is very specific the chat history is important it decides if the model can write a specific band name correct or not. I kept the chat history and tried it with 3 models 3 times. The name of the band consists of English words which the model can write in other context

Yuma42 q4ks:
Correct information about the band: 2 out of 3 times.
Typo in the band name: 0 out of 3 times

i1qks:
Correct information about the band: 0 out of 3 times.
Typo in the band name: 3 out of 3 times

i1qxs:
Correct information about the band: 0 out of 3 times
Typo in the band name: 3 out of 3 times and butchered it more than i1qks

Next I'm downloading your static q4ks to see if it behaves like mine or not

Update:

With your static q4ks quants:
Correct information about the band: 0 out of 3 times.
Typo in the band name: 3 out of 3 times

So it seems that it's not the imatrix that makes this problem. I can later send you the link to the notebook which I used to make the GGUF maybe it helps you but that's above my knowledge.

While there is some variation in imatrix quants (because different training data) the static quants would be identical regardless of whoever creates them. So if you see differences between the same static quants, it's likely due to inherent randomness or because different settings are used. It could be also due to a broken model/tokenizer, and llama versions. But unless your llama version is very old that also shouldn't make a difference (nothing has changed for q4_k_s for a long time).

i.e. if you see differences between static q4_k_s quants from different sources, it's unlikely to be due to the quant, as the quantisation is identical.

the first step i would do is to make absolutely sure the same settings/seed etc. are used for comparisons.

if that is ensures, one should compare the tokenizers - if a model supports multiple tokenizers (e.g. bpe and spm) one could be broken, and the other working, which could explain differences.

I can really exclude randomness as the reason because my q4ks never produces the misspelling and with the same chat history I got it first try with your q4ks (I changed the model and did regenerate).
It should be possible to compare hash sums between your static q4ks and mine if they are supposed to be identical right? We could try that.

If my memory is right, this is what I used to make the gguf of that model:
https://colab.research.google.com/drive/1P646NEg33BZy4BfLDNpTz0V0lwIU3CHu?usp=sharing

All I changed was the model ID and Quantization method (to q4_k_s)

The quantisation is identical, not the files. The files contain timestamps and will not match exactly.

This is not question of proving or believing, there is no randomness introduced when doing static quantisation, it's deterministic.

As for autogguf, it is indeed different as it does two quantisations (first to f16, then to the final one), while I don't quantise models during conversion but keep them in the original precision. I also don't know what version of llama.cpp it uses - if it is really old, then there will have been changes (mostly bugfixes).

Newer llama.cpp also autodetects the tokenizer, so if there are multiple tokenizers, mine might have picked a bad one.

But that is basically the only differences that cna be.

I see I misunderstood you I thought by randomness you were talking about things like the effect of temperature during my test run.

I'll try to look at the headers to see what tokenizer model was used. Maybe that explains any difference.

Ah, wait, you haven't uploaded your q4_k_s anywhere, right? In that cas,e have a look at the kv header values, to see if you spot a difference. too bad convert.py does nto store the vocabulary type.

You mean this? I have it linked on my model page:
https://huggingface.co/Yuma42/KangalKhan-RawRuby-7B-GGUF

I mean that, yes :)

The tokenizers are definitely different. What vocabulary/tokenizer does the model officially use?

My model should inherit these things from argilla/CapybaraHermes-2.5-Mistral-7B which is based on teknium/OpenHermes-2.5-Mistral-7B which adds the tokens for ChatML to mistralai/Mistral-7B-v0.1 as far as I know. So I would expect my model to work as a drop-in replacement for OpenHerme 2.5, that was my goal. I didn't declare any of these things manually but instead controlled it by choosing the specific models as my base. But I was in the impression that the json files have the complete information and autogguf got the necessary informations from them.

Wouldn't a wrong tokenizer lead to much more mistakes, like garbage output?

The information is unfortunately not available in any json file, and convert.py will just guess. If the guess is wrong, the resulting model might crash or malfunction. autogguf does not look at the information at all, nor does it have any logic. It simply calls convert (which is wrong for most models, but is the correct one for yours). Unless it uses a very old version, it would chose the same vocabulary type though (in fact, old versions always try to use spm, if I remember correctly, opnly newer versions try to detect the vocabulary).

So, by all indicators, what you see is likely either a fluke; your settings are indeed not identical; there is a bug in newer llama.cpp versions; or a bugfix in newer llama.cpp versions actually causes this. There have been reports about the latter two, where perplexity actually went up, but nothing substantiated.

Ah, and a wrong tokenizer leads to whatever the wrong tokenizer leads to. The wrong one might work better, or worse. Usually to crashes, or really bad results :)

Sign up or log in to comment