Imatrix MMLU-Pro differences

#1
by saishf - opened

Out of couriosity i decided to compare both your non-imatrix and imatrix quants of llama3-8B-Instruct using this tool with LM-Studio-0.2.27 to compare results on the MMLU-Pro test thing
Non I-matrix @ Q5_K_M

biology business chemistry computer science economics engineering health history law math other philosophy physics psychology total
57.32 24.71 14.93 29.27 43.84 19.40 33.62 25.20 15.71 13.92 35.82 37.07 14.32 49.12 27.25

I-Matrix @ Q5_K_M

biology business chemistry computer science economics engineering health history law math other philosophy physics psychology total
54.81 22.94 13.43 18.78 35.78 19.92 32.76 20.47 15.89 14.43 40.48 36.07 11.93 46.87 25.74

Or a 5.7% worse score from the I-matrix Q5_K_M, but much worse in computer science ~ 43% worse

fascinating, I wonder if that means that you can really tune out brain parts using just an imatrix - the imatrix training contains very little science... but also no law, math, but a whole lot of "other" :) That is especially significant as Q5_K_M should not offer much difference at all between imatrix and non-imatrix quants.

PS: I re-did this model a few days ago - I assume you tested the current one?

Looking at the quants, there really shouldn't be much of a difference - whats the expected variance on mmlu pro? is 5.7% significant?

This is what i can find mentioning variances i can find in the paper.

we evaluated models using 24 different but reasonable prompts. Figure 5 showcases the score range for different models under varying prompts. On the MMLU benchmark, the influence of these prompts generally ranges between 4-5%, with peaks up to 10.98%. In contrast, on the MMLU-Pro benchmark, the impact of prompt changes is generally around 2%, with a maximum of 3.74%. This reduced variability highlights an improvement in consistency and reliability over the original MMLU benchmark, ensuring more reliable assessments of language models’ capabilities.

Github.

Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro.

I have asked for another quant here
I'll run that through the same test and that should give a better idea about variations.

So, what is the conclusion ? Or there isn't ? I still use LM Studio for running OS models. I test most of mradermacher ones. Currently, I'm testing L3 umbral mind RP v3, .... But the best so far on my personal results which blew my mind are: Lamma 3 Lexi Smaug Uncensored ( excellent, excellent ), and L3 uncen merger omelette RP v0 1 I ( this really blew out of the water 😎 ).

bartowskis quants are readily available, too (and from around the original release date).

Interesting observation, Sai, while I personally do not believe the imatrix calibration data to be able to so significantly butcher/align the original weights -- tho we've seen that more extreme cases can certainly overfit and make the model go crazy so just maybe it's not as unheard of -- I am still hopeful that it might be the case, could be something else to look into and comparing similar quants to make sure is a good idea, I should upload the Q5_K_M in a little bit, for that, with a very standard imatrix data sample.


Update:
File is up.

Well, if using the same tool for measuring, the further results would always be the same comparing other models as well, we need to investigate about it's truth, because statistics are ways to diagnose and discover some things that we can't just by looking. Anyway, this post, instead of bringing sadness if it's true, may be a way to discover how to measure them better and improving. It's good when we discover mistakes. It's a way to improve. 🙏👍

I'm going to quickly run both new Q5's over the computer science test to get some idea, full tests take a while (roughly 5 hours each x^x)

Cya tomorrow, most likely.

The interesting thing is that the biggest difference an imatrix should be able to introduce on a Q5_K_M quant is the difference between Q5_K and Q6_K (the imatrix pretty much just distributes q5/q6 differently for them). The effect should be much more pronounced on quants <Q4.

The quick computer science testing:

Bartowski's Q5

Result for computer science:
Total, 110/410, 26.83%
Random Guess Attempts, 145/410, 35.37%
Correct Random Guesses, 15/145, 10.34%
Adjusted Score Without Random Guesses, 95/265, 35.85%
Markdown Table:

computer science total
26.83 26.83

Lewdiculous's Q5

Result for computer science:
Total, 103/410, 25.12%
Random Guess Attempts, 125/410, 30.49%
Correct Random Guesses, 9/125, 7.20%
Adjusted Score Without Random Guesses, 94/285, 32.98%
Markdown Table:

computer science total
25.12 25.12

Mradermacher's Q5-Imat run #2
Result for computer science:
Total, 82/410, 20.00%
Random Guess Attempts, 129/410, 31.46%
Correct Random Guesses, 11/129, 8.53%
Adjusted Score Without Random Guesses, 71/281, 25.27%
Markdown Table:

computer science total
20.00 20.00

Mradermachers's Q5-Imat run #3
Result for computer science:
Total, 76/410, 18.54%
Random Guess Attempts, 121/410, 29.51%
Correct Random Guesses, 9/121, 7.44%
Adjusted Score Without Random Guesses, 67/289, 23.18%
Markdown Table:

computer science total
18.54 18.54

Full tests will give more to judge from but so far all Imats are inferior to non Imats in computer science

Edit: full test for Bartowski's quant is running now, I'll start Lewdiculous's afterwards and post the results as soon as i get the chance :3

Excellent. Please, continue to post these results later. 🙏👍❤️

Is there any table of results of this tool for the top OS LLMs here ? It would be interesting if with time, we could have the results of every LLM uploaded, so that we could check for the most "intelligent", or another parameter. 👍

Is there any table of results of this tool for the top OS LLMs here ? It would be interesting if with time, we could have the results of every LLM uploaded, so that we could check for the most "intelligent", or another parameter. 👍

I personally wouldn't compare my results to others, I can't verify the testing results will be comparable or if testing conditions are the same. (GitHub repo claims it will be identical but I can't definitively say if they are)
System prompt, COT input, temp, output length and more can all vary. But most importantly my models are quantized, so will perform worse than fp/bf16
& I don't have the compute to test bigger models with my settings to compare to, I can't go bigger than 11B parameters

If you're interested in comparing some open models to closed models you can see that here.
MMLU-Pro-Leaderboard

  • This is another account, I lost my access to my auth tool and Microsoft locked my account for 30 days because I proved my identity and added a recovery email 😶‍🌫️

The thing that somehow makes me doubt these results is that both imatrix and non-imatrix q5_k_m quants are essentially just a mix of q5/q6 tensor quants, and afaik, the quantisation process is not affected by the imatrix for these quant types - the imatrix merely is used to decide which tensors to quant as q5 and which as q6. So it is very hard to explain a significant difference that is larger than the normal difference between q5 and q6 quants (which shouldn't be as dramatic as these differences). Of course, there could be very nonlinear effects depending where a q5 could be better than a q6 quant by pure chance, or a certain combination that works out differently (again, that would be by chance), but that would be very spurious, and should basically be random between models, sometimes making one quant better than the other.

This is in contrast to lower quantisation types, where e.g. bugs could play a major role (because e.g. iq2 quants are not used in static quants). But the code to do imatrix q5/q6 quants of tensors is the same for both imatrix and non-imatrix quants, so any bugs would affect both types.

There is a chance that these results indicate some big issue, such as llama.cpp applying the imatrix "the wrong way" (e.g. quantising the important tensors with less bits rather than more), but that's then equally likely that there is a systematic issue with the test application itself.

I think there are enough results here to warrant a deeper investigation, though - it seems unlikely that these results are due to random chance - at least unlikely enough for somebody (not me :) to spend some time getting to the bottom of this.

Once i have full results i'll open a discussion on the llama.cpp github, I can't come to a conclusion myself as i don't have the knowledge. I'd rather leave it to the devs to conclude (i imagine they will probably explain it in a few minutes 😸)

MMLU & MMLU-Pro scores have always been interesting , like Phi-3-mini-128k-instruct outperforms Nous-Hermes-2-Mixtral-8x7B-DPO in the new open-llm-leaderboard's MMLU-Pro section. Knowing that MMLU-Pro is designed to test wide knowledge and reasoning makes it hard to believe that Phi can outperform a model with more than 10 times the parameters.

I also decided to token count, the last one and a half runs have processed+generated 48.6 million tokens
The Lmstudio log txt is 1.8GB 😿

I'll have Bartowski's quant results in 15ish minutes :3

Source biology business chemistry computer science economics engineering health history law math other philosophy physics psychology total
Non I-matrix @ Q5_K_M 57.32 24.71 14.93 29.27 43.84 19.40 33.62 25.20 15.71 13.92 35.82 37.07 14.32 49.12 27.25
I-Matrix @ Q5_K_M 54.81 -4.47% 22.94 -7.42% 13.43 -10.57% 18.78 -43.66% 35.78 -20.24% 19.92 +2.64% 32.76 -2.59% 20.47 -20.71% 15.89 +1.13% 14.43 +3.59% 40.48 +12.21% 36.07 -2.73% 11.93 -18.20% 46.87 -4.68% 25.74 -5.69%
Bartowski Q5_K_M 57.88 +0.97% 20.91 -16.65% 14.66 -1.82% 26.34 -10.53% 39.81 -9.63% 20.95 +7.68% 36.06 +7.00% 24.41 -3.18% 17.17 +8.88% 13.19 -5.38% 39.11 +8.78% 34.27 -7.84% 13.09 -8.97% 48.12 -2.05% 26.89 -1.32%
  • "+" & "-" are the percentile differences from the score of the Non I-matrix quant
Source biology business chemistry computer science economics engineering health history law math other philosophy physics psychology total
Non I-matrix @ Q5_K_M 57.32 24.71 14.93 29.27 43.84 19.40 33.62 25.20 15.71 13.92 35.82 37.07 14.32 49.12 27.25
I-Matrix @ Q5_K_M 54.81 -4.47% 22.94 -7.42% 13.43 -10.57% 18.78 -43.66% 35.78 -20.24% 19.92 +2.64% 32.76 -2.59% 20.47 -20.71% 15.89 +1.13% 14.43 +3.59% 40.48 +12.21% 36.07 -2.73% 11.93 -18.20% 46.87 -4.68% 25.74 -5.69%
Bartowski @ Q5_K_M 57.88 +0.97% 20.91 -16.65% 14.66 -1.82% 26.34 -10.53% 39.81 -9.63% 20.95 +7.68% 36.06 +7.00% 24.41 -3.18% 17.17 +8.88% 13.19 -5.38% 39.11 +8.78% 34.27 -7.84% 13.09 -8.97% 48.12 -2.05% 26.89 -1.32%
Lewdiculous @ Q5_K_M 58.58 +2.17% 22.81 -7.99% 14.13 -5.50% 23.17 -23.26% 40.17 -8.73% 20.74 +6.67% 36.31 +7.69% 20.73 -19.46% 17.26 +9.40% 12.14 -13.66% 39.50 +9.77% 35.67 -3.84% 13.16 -8.44% 49.75 -1.27% 26.89 -1.32%

"+" & "-" are the percentile differences from the score of the Non I-matrix quant


I've decided I'm going to run the test on the non i-matrix variant again to see how much run to run variance there is, as there's no mention of run to run variance on any documentation for MMLU-Pro. Only for system prompt to system prompt variance

Source biology business chemistry computer science economics engineering health history law math other philosophy physics psychology total
Non I-matrix @ Q5_K_M 57.32 24.71 14.93 29.27 43.84 19.40 33.62 25.20 15.71 13.92 35.82 37.07 14.32 49.12 27.25
I-Matrix @ Q5_K_M 54.81 -4.47% 22.94 -7.42% 13.43 -10.57% 18.78 -43.66% 35.78 -20.24% 19.92 +2.64% 32.76 -2.59% 20.47 -20.71% 15.89 +1.13% 14.43 +3.59% 40.48 +12.21% 36.07 -2.73% 11.93 -18.20% 46.87 -4.68% 25.74 -5.69%
Bartowski @ Q5_K_M 57.88 +0.97% 20.91 -16.65% 14.66 -1.82% 26.34 -10.53% 39.81 -9.63% 20.95 +7.68% 36.06 +7.00% 24.41 -3.18% 17.17 +8.88% 13.19 -5.38% 39.11 +8.78% 34.27 -7.84% 13.09 -8.97% 48.12 -2.05% 26.89 -1.32%
Lewdiculous @ Q5_K_M 58.58 +2.17% 22.81 -7.99% 14.13 -5.50% 23.17 -23.26% 40.17 -8.73% 20.74 +6.67% 36.31 +7.69% 20.73 -19.46% 17.26 +9.40% 12.14 -13.66% 39.50 +9.77% 35.67 -3.84% 13.16 -8.44% 49.75 -1.27% 26.89 -1.32%
Non I-matrix @ Q5_K_M Run 2 58.44 +1.93% 23.70 -4.17% 14.93 = 28.05 -4.25% 43.72 -0.27% 18.68 -3.78% 33.74 +0.35% 23.36 -7.57% 15.71 = 12.88 -7.76% 35.61 -0.58% 36.67 -1.08% 14.32 = 49.75 +1.27% 26.99 -0.95%

"+" & "-" are the percentile differences from the score of the Non I-matrix quant

The largest present outlier from run to run of Non-Imat is -7.76%. The overall variance is -0.95%.
I still can't figure out what is causing the Imats to score weirdly low in some categories.
And to run each model a total of 3 times to see run to run variance of all the models would take 32 hours more 😶‍🌫️

I would think this is enough data to open a ticket on llama.cpp, although in the past, I was told that imatrix quants are not considered useful or important.

I would think this is enough data to open a ticket on llama.cpp, although in the past, I was told that imatrix quants are not considered useful or important.

I opened a discussion, it's not really urgent and they're already flooded with issues (299 x.x)

Just a curiosity, no 7B model could solve the killer's problem for me yet:

"There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room ?"

Some could be close to solving, but perhaps only a 70B model could. 🙏💥

Just a curiosity, no 7B model could solve the killer's problem for me yet:

"There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room ?"

Some could be close to solving, but perhaps only a 70B model could. 🙏💥

This is Phi Mini V3.0 (old version) in Q5_K_M - 3.82B

image.png

Llama3-70B VS Phi-3-Small-7B - (LMSYS Chatbot Arena)

image.png

If i had to decide which response is better, i would lean towards Phi-3-Small.

Otherwise, they wouldn't have been able to kill someone

Llama 3 assumes that someone has to already be a killer in order to kill someone else. Someone can kill another person without already being a "killer"

The results are probably biased due to the high amount of academic and educational training of Phi models


PS: Even big models still mess it up

Gemini 1.5 Pro & Flash fail, Gemma2 27B Instruct fails, GPT 3.5 Turbo fails, Command R Plus fails when Command R doesn't??, Qwen 1.5 110B fails
And Zephyr wanted nothing to do with the question

image.png

Yi Large Preview's response is what i'd consider the best

Yi-Large-Preview
Let's break down the problem step by step:
  1. Initial State: We start with three killers in the room. These individuals are known for their past actions of killing. So, at the start, we have:

    • Killer 1
    • Killer 2
    • Killer 3
  2. Someone Enters: A fourth person enters the room and kills one of the killers. Let's say this person kills Killer 2. Now, the room contains:

    • Killer 1
    • The dead body of Killer 2 (who is no longer considered a living killer)
    • Killer 3
    • The fourth person (who has now committed a killing act and can be considered a killer)
  3. Analysis of the Remaining Killers: The question asks for the number of killers left in the room, not the number of living people or the original killers. Since the fourth person has committed a killing act upon entering the room, they are now classified as a killer according to the context of the riddle. Therefore, we count:

    • Killer 1 (original)
    • Killer 3 (original)
    • The fourth person (newly classified as a killer)
  4. Conclusion: There are three killers left in the room. This includes the two original killers who did not get killed and the new person who entered and killed one of the original killers, thereby becoming a killer themselves. The dead body of Killer 2 does not contribute to the count of living killers.

The correct answer is there are now 4 killers in the room. 3 that already was there initially, and one that entered the room, which became a killer after killing one person. Nobody left the room. So, it doesn't matter if the killers are dead or alive, they continue to be killers. LLMs don't often discard the dead killer as if it disappeared or is not a killer simply because he is dead. Hahaha 😂🙏💥

The correct answer is there are now 4 killers in the room. 3 that already was there initially, and one that entered the room, which became a killer after killing one person. Nobody left the room. So, it doesn't matter if the killers are dead or alive, they continue to be killers. LLMs don't often discard the dead killer as if it disappeared or is not a killer simply because he is dead. Hahaha 😂🙏💥

I managed to get Sonnet before it hit the API limit in lmsys

This is a tricky question that requires careful consideration. Let's break it down:

  1. Initially, there are three killers in the room.

  2. Someone enters the room and kills one of the killers.

  3. Nobody leaves the room.

  4. The person who entered is also a killer (since they killed someone).

So, to calculate the number of killers left in the room:

  • Original killers: 3
  • Killers eliminated: -1
  • New killer added: +1

Therefore, the final number of killers in the room is:
3 - 1 + 1 = 3

There are still three killers left in the room.

This question plays with our assumptions, as we might initially think the answer is two (three original killers minus one who was killed). However, it's important to recognize that the person who entered and committed the murder is also a killer, thus maintaining the count at three.

So this question still isnt solved correctly by GPT-4o & Claude 3.5 Sonnet

Claude still can't get it if you specify how many are left historically
"There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room historically?"

Yes. Thank you for bringing the Sonet answer. I did to Chatgpt 4o and the answer was really bad, it said that there was 2 killers left. Well, almost all my 7B models at least sustained 3 killers, not 2, and even more interesting considering that once Chatgpt 4 solved for me saying 4, so, perhaps confirming what some people are saying that Chatgpt 4 Omni is worse in reasoning than version 4. The 7 to 11B models here can understand perfectly if I explain that a killer continues to be a killer after being dead on a shooting, but for my dismay, Chatgpt struggled with that for many interactions until finally understand the point. Chatgpt 4o also did not solved for me on the single shot the problem of the grandfather sister:

"The father of my father is called Peter. He has a sister called Nancy. What is the parental relationship between me and Nancy ?"

Answer: It should be great aunt.

This is a simple question that most LLMs give me the wrong answer. Some give completely absurd answers indeed, like she's the aunt of my mother 😅😅👍💥

Sign up or log in to comment