MMLU blog post discussion

#82
by thomwolf HF staff - opened
Hugging Face H4 org

This is a discussion page for the blog post diving in all the various ways MMLU can be evaluated (in particular for the Falcon and LLaMA models): and available at https://huggingface.co/blog/evaluating-mmlu-leaderboard

Is there a script/code to regenerate all the metrics from the blog post? thanks!

Ideally, a good test should be realistic, unambiguous, luckless, and easy to understand. Showing fairness is easier to do by the negative:

  1. If a model passes a question, but if you asked it in a chat, it would never give the right answer, then the test is not realistic. So HELM’s rejecting an answer if it is not the highest-probability one is reasonable.
  2. If a model sometimes had a high pass rate, sometimes low, its result would be ambiguous. So realism should not go all the way to using normal sampling like nucleus9. Yet…
  3. If a model passes a question, but if you asked in a chat, the answer would be basically random, then the test is lucky. So the test should account for how close the probability is for each answer: if they are all near-equal, but the right one is imperceptibly higher, then that should be taken into account.
  4. Besides, if a test result makes it unclear just how bad it is, then it is harder to understand. NeoX’s 25% could be mistaken for an OK score, but it is essentially as good as a coin flip.

What if we averaged the probability of the right answer across tasks?

  • The result would be on a clear centigrade scale (0% is bad, 100% is good).
  • Uncertainty between answers (nearby probabilities) would negatively impact the score.
  • It is also clearer, making it less likely that people would implement it differently (apart from the few-shot variations).

I see that models from EleutherAI/gpt-neox-20b are good of evaluated with HELM (Harness). And almost all of the next models follow the same trend. This means the models are good at predicting the probabilities of the whole answer rather than the option (from what I understand from the article). Is there any reason for that? I find it quite interesting.

There's a spelling error for the word 'implementation'. Didn't catch anything else. Good article! :)

"MMLU comes in all shapes and sizes: Looking at the prompts
Let’s compare an example of prompt each benchmark sends to the models by each implmentation for the same MMLU dataset example:"

Great article! We have experienced something similar while developing InstructEval (https://declare-lab.net/instruct-eval/). Codes are here: https://github.com/declare-lab/instruct-eval

in your detailed number ranking, with MMLU original implementation, llama30B is better than falcon40B so it in the map it should be #2 not #3.

I see now HELM as a broken evaluation. Indeed, most of the LLMs tend to have a conversational tone for the responses, so it's bizarre to expect the first generated token will be a choice number.

Another way to select the answer from the output of LLMs would be via knn. We just generate a text from LLMs and then see what is the closest answer that corresponds to it.

I see that you say that you are using EleutherAI's harness for the MMLU benchmark on the Open LLM leaderboard. Is that the (relatively new) original implementation in the harness or is it using EleutherAI's "updated" version? Thanks

Hugging Face H4 org

@Baruch at the moment, numbers displayed on the leaderboard reflect the January implementation of the MMLU in the Harness - we are re-running all the leaderboard as we speak on the fix the introcued in order to display correct results as soon as possible :)

@clefourrier and team, thank you for your work on the Open LLM Leaderboard. I find the scores helpful when comparing models. Your article clearly outlines the issues regarding scoring MMLU.

I for one, would find it beneficial:

  • if you also show the baseline scores for any test (25% for MMLU is as good as a guess), and corresponding human baselines (non-expert and expert).
  • if you add other benchmarks (for example BBH), if possible broken down by what a test is testing
    (eg. MMLU tests primarily for knowledge recall, not so much reasoning).

As for MMLU scoring, what I find most intuitive is to give a model a score as one would for a human.

The reason is, for me, multiple choice is a two-step process: if a person knows the answer, he compares his answer with the list of choices and ticks off the choice label given (A, B, C, or D). If he does not know the answer, he may well make a guess what he thinks to be the best choice among the given choices. To discourage the latter, a better test would include "none of the above" as a choice.

Currently, it seems you guys are choosing between different implementations (original, Helm, or AI Harness). May I suggest one that takes into account your findings, yet is intuitive.

Grant a point:

  1. for a choice label - Helm, or for full answers - AI Harness
    or if a model's answer is semantically the same as the gold answer
    (if the gold answer is 'A. It damaged support for the US model of political economy and capitalism', I would award a point if the answer were 'It damaged support for America's political economy and capitalism' ). Of course this makes evaluation even more complicated in implementation.
  2. and always include a facility for the model to answer "none of the above" (same as 'Zygote' in your article),
    perhaps in the instruction or few-shot prompt.

We could call this MMLU-fair (HF implementation). I'd be happy to help how I can.

Hugging Face H4 org

Hi @i-am-neo !
Thank you for your comments about the article and suggestions regarding the leaderboard! 🤗

There is a line called "Baseline" at the bottom of the leaderboard, which includes the results of a random baseline. Adding human baselines is a nice idea, I'm adding it to our future modifications.

Regarding your MMLU-fair, though I like the idea, we will keep on using the Harness implementation at the moment, because it is very important to ensure reproducibility by all users: we want people to be able to checkout our current version of the Harness and get the same results as us. However, if we experiment with such a metric at one point, I'll be sure to post about it!

Thank you for the interesting blog!
I read the following comment on GitHub. HELM seems to evaluate only 5 subjects of MMLU.

but we currently only evaluate 5 of the MMLU subjects, rather than doing a complete evaluation.
https://github.com/stanford-crfm/helm/issues/1698#issuecomment-1615475461

Are the scores of HELM in this blog calculated based on all subjects of MMLU? (My apologies if I have missed something.)

Hugging Face H4 org
edited Jul 7, 2023

@takkii Thank you for your interest!
Yes, we evaluated on all subjects with HELM too.

clefourrier pinned discussion

I know this is a bit off topic, but I'm not sure where else to ask it, and it is at last related to the leaderboard and definitions thereof.

I was hoping to compare the new MPT-30 model to ones on the leaderboard using the metrics in their announcement post (https://www.mosaicml.com/blog/mpt-30b) and I realized that your documentation for the ARC tests doesn't specify whether you're using the easy questions, the challenge questions, or a mix.

The evaluation harness task list lists the easy and challenge separately: https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md, and the code also has that separation: https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/arc.py - is the leaderboard reporting one, the other, or both of those task results, somehow combined?

Please disregard my question above - I figured it out once I read the "how to reproduce" instructions and saw the arc-challenge in there.

new question: for leaderboard tests like ARC and HellaSwag, is the reported performance the acc or the acc_norm (https://blog.eleuther.ai/multiple-choice-normalization/)?

Did you guys re-run only MMLU because some other scores are off:
Example arc_challenge:
huggyllama/llama-7b (llama7b)
{
"results": {
"arc_challenge": {
"acc": 0.47952218430034127,
"acc_stderr": 0.014599131353035012,
"acc_norm": 0.5110921501706485,
"acc_norm_stderr": 0.01460779491401305
}
},
"versions": {
"arc_challenge": 0
},
"config": {
"model": "hf-causal",
"model_args": "pretrained=huggyllama/llama-7b,trust_remote_code=True,dtype=float16",
"num_fewshot": 25,
"batch_size": null,
"batch_sizes": [],
"device": null,
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
hf-causal (pretrained=huggyllama/llama-7b,trust_remote_code=True,dtype=float16), limit: None, provide_description: False, num_fewshot: 25, batch_size: None

Task Version Metric Value Stderr
arc_challenge 0 acc 0.4795 ± 0.0146
acc_norm 0.5111 ± 0.0146

==> your table reports 46.6

Falcon7b
{
"results": {
"arc_challenge": {
"acc": 0.4334470989761092,
"acc_stderr": 0.014481376224558895,
"acc_norm": 0.4786689419795222,
"acc_norm_stderr": 0.014598087973127102
}
},
"versions": {
"arc_challenge": 0
},
"config": {
"model": "hf-causal",
"model_args": "pretrained=/media/vincent/Extreme SSD/dataAI/Falcon7B/HF,trust_remote_code=True,dtype=bfloat16",
"num_fewshot": 25,
"batch_size": null,
"batch_sizes": [],
"device": null,
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
hf-causal (pretrained=/media/vincent/Extreme SSD/dataAI/Falcon7B/HF,trust_remote_code=True,dtype=bfloat16), limit: None, provide_description: False, num_fewshot: 25, batch_size: None

Task Version Metric Value Stderr
arc_challenge 0 acc 0.4334 ± 0.0145
acc_norm 0.4787 ± 0.0146

==> your table report 48.1

MMLU llama-7b is also wrong.

@emilyva All the scores selected are in the about section :)

@vince62s Thank you for your interest! No, we re-ran the full suite to make sure we had correct results for the version of the harness we use.
Where do your scores come from? If you ran them yourself, did you make sure to 1) use the same version of the harness as us (it's in the blog post) and 2) use batch size 1?

which llama-7b did you rescore ? there is no way MMLU on original llama-7 can be 38.3
even the paper shows 35.1 and all other rescoring showed 35.ish

yes I did rescore my self with current master.

Hi! Our full results are available here - could you make your comparision using the same parameters? (model hash, batch size at 1, and this commit edit: this commit of the harness?
We notably observed that changing the batch size can affect the results in a non negligeable way.

This is very confusing. This commit is quite old(Apr 25) and the whole story is that some fixes have been issued since, and @thomwolf blog was all about using a recent version with some fixes, notably PR #497 of harness.

Am I wrong about something ?

Hugging Face H4 org

@vince62s Sorry, my bad, corrected the commit link - the first one above was the first commit where we observed discrepancies, the second one (from June) is the one we use

ok let's take one single example: MMLU HendryckTest-world_religions // Falcon-7B

This is in your results:
"harness|hendrycksTest-world_religions|5": {
"acc": 0.6081871345029239,
"acc_stderr": 0.03743979825926398,
"acc_norm": 0.631578947368421,
"acc_norm_stderr": 0.036996580176568775
}

I am cheking out harness on the commit above then run:
python main.py --model hf-causal --model_args pretrained="/dataAI/Falcon7B/HF",trust_remote_code=True,dtype=bfloat16,batch_size=1 --tasks "hendrycksTest-world_religions" --no_cache --write_out --num_fewshot 5

Results:
{
"results": {
"hendrycksTest-world_religions": {
"acc": 0.3567251461988304,
"acc_stderr": 0.03674013002860954,
"acc_norm": 0.3567251461988304,
"acc_norm_stderr": 0.03674013002860954
}
},
"versions": {
"hendrycksTest-world_religions": 1
},
"config": {
"model": "hf-causal",
"model_args": "pretrained=/dataAI/Falcon7B/HF,trust_remote_code=True,dtype=bfloat16,batch_size=1",
"num_fewshot": 5,
"batch_size": null,
"batch_sizes": [],
"device": null,
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}

The only thing which differs is "version" that seems to be "1" in my case and "0" in your results.

Hugging Face H4 org

That's quite interesting - I'll check this in more depth on Monday, thanks for raising it!

@clefourrier I did finally find the metrics in the About section further down. I was surprised that TruthfulQA uses mc2, since that score was surprisingly high for a very small model I tested locally (flan-t5-small), whereas the mc1 score was much more in line with what I would have expected from the other benchmarks... I guess I'm trying to say... are you sure it's mc2? :) (And also, why mc2 vs. mc1?)

Added: well, I replicated the EleutherAI/pythia-70m TruthfulQA score on my computer so now I believe that the mc2 in the About text is correct. It's interesting that the smallest pythia models do best on TruthfulQA, and their scores go down as the model size increases... maybe that's just because the smaller models can't overfit on human misconceptions, compared to the larger models...

That's quite interesting - I'll check this in more depth on Monday, thanks for raising it!

https://github.com/EleutherAI/lm-evaluation-harness/commit/3a424af4b1ed8191c179cb037ca09071ac0e92a1

I guess this is part of the answer, for MMLU at least.

@vince62s The commit we use is posterior to this one, so our launched version for MMLU should be 1 (and is actually 1 in all the recently uploaded files...) - I wonder if there might have been a mixup between old and new saved files, since we restructured the way they were saved very recently.
In any case, I'll check it asap next week as it would be quite a big problem if the numbers we report are the wrong MMLU version for some models. Thank you very much for being so thorough!
( @SaylorTwift FYI)

Hi! Our full results are available here - could you make your comparision using the same parameters? (model hash, batch size at 1, and this commit edit: this commit of the harness?
We notably observed that changing the batch size can affect the results in a non negligeable way.

I use the same commit you specified:

python main.py \
    --model hf-causal-experimental \
    --model_args pretrained=models/llama-13b-hf \
    --tasks arc_challenge \
    --num_fewshot 25 

I get the below results, which is higher than the value 50.8 (llama-13b/ARC) in https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard:

Task Version Metric Value Stderr
arc_challenge 0 acc 0.5307 ± 0.0146
acc_norm 0.5631 ± 0.0145

Can you give me any advice?

Hi @Linbo , which hub model do you use?
Regarding the tokenization improvement, yes, if it's included in the harness or the llama tokenizers we'll likely include it too

Hugging Face H4 org

@vince62s Thank you very much for your investigation and interest, I'm very grateful you identified this problem.
From what I discovered, for 80 models, the old experiment results (with the wrong MMLU scores) were stored instead of the new experiment results. I removed them from the view (but not from the dataset, however, since the version is clearly displayed there, as you noticed).
We'll update the leaderboard as soon as possible with the correct files for all these models.

I think I would check at least arc_challenge and TruthfulQA too since I saw some discrepancies too.

Hi @Linbo , which hub model do you use?
Regarding the tokenization improvement, yes, if it's included in the harness or the llama tokenizers we'll likely include it too

llama-13b, I find the problem is the tokenization improvement, I remove this modification, the score are matched: https://github.com/EleutherAI/lm-evaluation-harness/commit/23f30926f3ce738e3eee4e6be5c29fb3467e3a6e
So I guess your code doesn't include this commit.

@Linbo which llama-13b? :)
Thank you for the link to this commit, I will take a deeper look

Hugging Face H4 org

@vince62s For arc challenge it's likely an unrelated problem linked to tokenization - for TruthfulQA I'd be super interested if you had an example

@Linbo which llama-13b? :)

meta-llama-13b i converted to hf format. I tested https://huggingface.co/lmsys/vicuna-13b-v1.3 as well, open llm leaderboard ARC=49.2 vs my results ARC=53.

Hugging Face H4 org

Thank you! Investigating

Thank you! Investigating

Btw, I checked the papers, I think the current scores in open llm leaderboard (without tokenization improvement) are more aligned with the original ones in the papers.

Hugging Face H4 org

@Linbo and @vince62s thank you very much for your explorations!
We redefine the launcher class for HF models (where we added logic to make it faster), and we accidentally overrode the above function from @Linbo 's message.

We're going to update llama models as fast as possible! 🚀 Thank you! 🤗

@Linbo and @vince62s thank you very much for your explorations!
We redefine the launcher class for HF models (where we added logic to make it faster), and we accidentally overrode the above function from @Linbo 's message.

We're going to update llama models as fast as possible! 🚀 Thank you! 🤗

With this tokenization improvement, the llama scores will be higher than the original paper (and the original benchmark implementation as well), any idea about this?

In a sense, it does not matter for us (the leaderboard), as what we are most concerned with is reproducibility.

For people choosing models, I guess it's interesting to know that if you want performant models on these tasks, llama-based models are a good choice, but said performance is highly dependent on small nits in tokenization, which may make them more fickle if not managed well.

@Linbo I'm not following you. Which scores will be higher than the paper?

@Linbo I'm not following you. Which scores will be higher than the paper?

For meta/llama-13b

  1. open llm leaderboard without tokenization improvement, ARC= 50.8
  2. with this tokenization improvement, ARC=56.3
  3. from llama paper ARC= 52.7
    image.png

I think it's similar to other models, based on this Twitter: https://twitter.com/itanih0/status/1679546665777111048

oh ok I thought you were talking about MMLU.

are you sure it's comparable ? paper says "zero shot"
arc_c of leader board is 25 shots I think

oh ok I thought you were talking about MMLU.

are you sure it's comparable ? paper says "zero shot"
arc_c of leader board is 25 shots I think

You are right, my bad. Haha, the more truth is debated, the clearer it becomes.

The same applies for the "timdettmers/guanaco-65b-merged" model in the leaderboard. There is just no way this model has a 32 Avg. score. It's almost half the value of a 7B model. I think some tests might have some issues going on. The score doesn't make any logical sense

Hugging Face H4 org

@Wubbbi
Yep it's likely the problem affects all llama-based models (llama, guanaco, alpaca, ...) - we're doing our best to fix this asap :)

@Wubbbi
Yep it's likely the problem affects all llama-based models (llama, guanaco, alpaca, ...) - we're doing our best to fix this asap :)

@clefourrier Have you added the above tokenization improvement for the following result?

image.png

@Linbo Yes, the llama 2 models scores are completely correct/should be reproducible using the Harness.
They were launched after the debug end of last week :)
(Plus, people at Meta told us we were "in range" ^^)

@Linbo Yes, the llama 2 models scores are completely correct/should be reproducible using the Harness.
They were launched after the debug end of last week :)
(Plus, people at Meta told us we were "in range" ^^)

What happens to the models that have wrong scores? Will they be re-evaluated? Does that happen automatically? Do they have to be submitted again?

Hugging Face H4 org

We are re-running all the llama-based models as we speak, however, if you fear that your model is not being re-ran, please open an issue and tag @SaylorTwift and me, and we'll take care of it asap.

@clefourrier For the MMLU score reported on the leaderboard, the reproducibility section says that it's the acc of all, but doesn't indicate whether that accuracy is arrived at by doing an average of the individual task accuracies, or a weighted average based on the total number of items per task, or something else... which is HuggingFace doing, please? (If there's a code repo for the leaderboard somewhere that I could be looking at instead of asking these questions, please point me there!)

It looks like the original MMLU code (https://github.com/hendrycks/test/blob/master/evaluate.py) does a weighted average of the items within each subject area (across the tasks grouped within that subject area), but given that the number of items isn't part of the lm-evaluation-harness output for the hendrycksTest tasks, it seems less obvious how to weight the results for the various tasks using that output.

Thanks for any insight into this that you can share!

Hugging Face H4 org

@emilyva The score is just a simple average of individual tasks accuracies :)

@clefourrier That was my guess (especially after comparing that result to the leaderboard published result for one of the models), but thanks for confirming!

Hugging Face H4 org

Closing due to inactivity (but it's linked in the resources tab for archival purposes)

clefourrier changed discussion status to closed

Sign up or log in to comment