open-llm-leaderboard/open_llm_leaderboard · MMLU blog post discussion

Open LLM Leaderboard org Jun 23, 2023

This is a discussion page for the blog post diving in all the various ways MMLU can be evaluated (in particular for the Falcon and LLaMA models): and available at https://huggingface.co/blog/evaluating-mmlu-leaderboard

vgoklani

Jun 23, 2023

Is there a script/code to regenerate all the metrics from the blog post? thanks!

espadrine

Jun 23, 2023

Ideally, a good test should be realistic, unambiguous, luckless, and easy to understand. Showing fairness is easier to do by the negative:

If a model passes a question, but if you asked it in a chat, it would never give the right answer, then the test is not realistic. So HELM’s rejecting an answer if it is not the highest-probability one is reasonable.
If a model sometimes had a high pass rate, sometimes low, its result would be ambiguous. So realism should not go all the way to using normal sampling like nucleus9. Yet…
If a model passes a question, but if you asked in a chat, the answer would be basically random, then the test is lucky. So the test should account for how close the probability is for each answer: if they are all near-equal, but the right one is imperceptibly higher, then that should be taken into account.
Besides, if a test result makes it unclear just how bad it is, then it is harder to understand. NeoX’s 25% could be mistaken for an OK score, but it is essentially as good as a coin flip.

What if we averaged the probability of the right answer across tasks?

The result would be on a clear centigrade scale (0% is bad, 100% is good).
Uncertainty between answers (nearby probabilities) would negatively impact the score.
It is also clearer, making it less likely that people would implement it differently (apart from the few-shot variations).

Luna4444

Jun 23, 2023

•

edited Jun 23, 2023

I see that models from EleutherAI/gpt-neox-20b are good of evaluated with HELM (Harness). And almost all of the next models follow the same trend. This means the models are good at predicting the probabilities of the whole answer rather than the option (from what I understand from the article). Is there any reason for that? I find it quite interesting.

GrahamxReed

Jun 24, 2023

There's a spelling error for the word 'implementation'. Didn't catch anything else. Good article! :)

"MMLU comes in all shapes and sizes: Looking at the prompts
Let’s compare an example of prompt each benchmark sends to the models by each implmentation for the same MMLU dataset example:"

soujanyaporia

Jun 24, 2023

Great article! We have experienced something similar while developing InstructEval (https://declare-lab.net/instruct-eval/). Codes are here: https://github.com/declare-lab/instruct-eval

vince62s

Jun 25, 2023

in your detailed number ranking, with MMLU original implementation, llama30B is better than falcon40B so it in the map it should be #2 not #3.

russellsparadox

Jun 27, 2023

I see now HELM as a broken evaluation. Indeed, most of the LLMs tend to have a conversational tone for the responses, so it's bizarre to expect the first generated token will be a choice number.

Another way to select the answer from the output of LLMs would be via knn. We just generate a text from LLMs and then see what is the closest answer that corresponds to it.

Baruch

Jun 28, 2023

I see that you say that you are using EleutherAI's harness for the MMLU benchmark on the Open LLM leaderboard. Is that the (relatively new) original implementation in the harness or is it using EleutherAI's "updated" version? Thanks

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	0.4795	±	0.0146
		acc_norm	0.5111	±	0.0146

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	0.4334	±	0.0145
		acc_norm	0.4787	±	0.0146

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	0.5307	±	0.0146
		acc_norm	0.5631	±	0.0145