Issue with LLAMA MMLU results

#63
by yahma - opened

According to https://twitter.com/Francis_YAO_/status/1667245675447468034

The Llama MMLU results are much lower than reported in the LLAMA papers.
https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU

Hello @clefourrier and @yahma ,

The lm-evaluation-harness commit used to evaluate on this Open LLM Leaderboard is 441e6ac. Evaluating at a later commit will give all LLaMA based models a boost (around 5 points on average). The previous evaluation is not fair to LLaMA since there was an issue with tokenization in the lm-evaluation-harness. @clefourrier Can you please look into it? The leaderboard would change significantly. Should I open a new issue with this? Just to give you a sense, LLaMA-65B would score 63.7 on the leaderboard with the tokenization fix. It is higher than Falcon-40B-Instruct. The obtained results on MMLU after the fix are consistent with the LLaMA paper. Guanaco 65B is the best model I evaluated so far and it reaches 66.8.

Hi @itanh0b and @yahma !
We've been doing some work to investigate the discrepancies for the MMLU scores, and we'll talk about it very soon, stay posted 🤗

@itanh0b just to be sure, could you develop on the tokenization problem you are talking about?

Ideally the leaderboard would be updated to use the fixed lm-evaluation-harness results. Falcon 40B performing better than the Llama models appears to be more a reflection of the issue with the scoring of the MMLU in older commits of lm-eval-harness rather than Falcon actually outperforming Llama. Having accurate, repeatable tests of the open models will benefit the OS community.

@clefourrier I'm not exactly sure how to explain it, I'm not an expert, but it seems it is related to how the context and continuation are encoded. Here is the pull request that fixes it. It's weird to me why it only affects LLaMA models.

@clefourrier the author of the commit provided me with an elaborate response regarding the tokenization problem in the lm-evaluation-harness commit used for Open LLM Leaderboard that puts LLaMA models at a disadvantage. Please check it out here.

Hugging Face H4 org

Thank you!

After these two PRs merged to master branch, the MMLU scores (especially models based on LLaMA) seem normal now.
https://github.com/EleutherAI/lm-evaluation-harness/pull/531
https://github.com/EleutherAI/lm-evaluation-harness/pull/497
Would you consider re-evaluate all models on MMLU?

latest code from Yao gets an MMLU score for Llama 65B to be about 63, while Falcon 40B gets only 49, far less than shown in the leaderboard.
Yao's code is available here: https://t.co/peXlNYcMux

Hugging Face H4 org

Hi @DanielTTY - Yao's code is sadly incorrect with respect to the original, as they launch their evaluations in a generative way, not looking at choices log-likelihoods.

Hi @wjfwzzc We'll do an announcement later this week, stay tuned!

Hugging Face H4 org

Hi @wjfwzzc , @yahma and @itanh0b !
Did you see our blog post on MMLU disrepancies, and did it answer your queries?

Hugging Face H4 org

The leaderboard has now been updated with the correct results, closing this issue :)
Feel free to reopen if you have more questions

clefourrier changed discussion status to closed

Sign up or log in to comment