Why MMLU is so much lower than the results reported in some papers like LLama 65B

#26
by lumosity - opened

Why MMLU is so much lower than the results reported in some papers like LLama 65B

MMMLU is extremely sensitive to the exact prompting one uses. LLaMA mentions that they use custom prompts, but doesn't disclose them.

Well, Thanks for reply. I run MMLU at HELM, and the result was not much different from the paper. Maybe you can try the prompt.

Well, Thanks for reply. I run MMLU at HELM, and the result was not much different from the paper. Maybe you can try the prompt.

Yes, I can reimplement the score in the paper with HELM. The MMLU score seems not right in this leaderboard.

I've described most of the difference between the original MMLU eval and the lm-evaluation-harness implementation in this issue.
This PR fixes that and brings the numbers closer to those in the LLaMA paper, 35.1 -> 32.2 for 7B model, 46.9 -> 46.9 for 13B model.

I implemented the MMLU benchmark for the LLaMA paper.

Basically if your sample was:

{
    "question": "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.",
    "choices": {
        "A": "0",
        "B": "4",
        "C": "2",
        "D": "6"
    },
    "answer": "B"
}

it would get turned into a prompt like this:

Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer:

Between shots, you would have "\n\n"

Note that, weirdly enough, this is not 100% deterministic: there might be some thread stuff going on in the GPUs.
Also: results are sensitive to the context length. Some samples, once added the 5 preliminary shots, are pretty long, and got truncated on the left. Depending on the total context length, you could get different results

Maybe it has to be something with prompt selection. But I don't think the absolute number really matters. Since every model is using the same prompt, the relative "goodness" is still preserved between models. The mmlu result of a not-so-carefully selected prompt maybe much closer to the feelings of common users in their daily usage.

The exact issue is that the MMLU paper uses the set-up @javier-m describes:

{
    "question": "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.",
    "choices": {
        "A": "0",
        "B": "4",
        "C": "2",
        "D": "6"
    },
    "answer": "B"
}

While we were using a set-up that's more in-line with typical practice in NLP:

{
    "question": "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.",
    "choices": {
        "A": "0",
        "B": "4",
        "C": "2",
        "D": "6"
    },
    "answer": "4"
}

As @olmer says, there's a PR that will "fix" this shortly, but I agree with @yyyoooxyz that it shouldn't matter. Or rather, if it does matter that's a condemnation of our eval benchmarks.

There is another important thing: how you do prompt truncation. Prompts should be left-truncated, so they end up by the final question, not right-truncated, otherwise the model does not see what to answer

I guess - smaller models depend very much on parameters like top_p, typical_p, temp etc. etc. it might be the people who wrote the article didn't want to disclose them, for some reason, while when running the model with default parameters the results are much worse. I have noticed that some 30B models, are extremely sensitive even to the slightest of change in those parameters. So while the reported values might not be false, the article writers do hide some "know how" about their models, and how to get the best results. On the other hand, they could just lie or have a bug or wtr. There were scientific articles that weren't true, and it took time to discover they are exaggerated or simply false.

Hugging Face H4 org

Hi @lumosity !
We published a blog post about these disrepancies, does it answer your questions?

Hi @lumosity !
We published a blog post about these disrepancies, does it answer your questions?

Yes, this is a good article. When evaluating MMLU, I have only used the evaluation tools in Helm. In this tool, the randomness of top_p and temperature is eliminated, but there may still be some closed-source models that cannot control randomness.

Actually, a more important question is how to reasonably evaluate the results of LLM and guide our training. In May, I conducted experiments on most of the datasets in Helm using LLama and found that LLama 65b does not necessarily outperform LLama 30b on many tasks. Even when using data like Alpaca for SFT, it is not necessarily better than during pretraining. This has been a source of confusion for our team, but I am glad to see that many people are working on this aspect.

clefourrier changed discussion status to closed

Sign up or log in to comment