Spaces:

HuggingFaceH4
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

728

Issue with LLAMA MMLU results

#63

by yahma - opened Jun 12, 2023

Discussion

yahma

Jun 12, 2023

According to https://twitter.com/Francis_YAO_/status/1667245675447468034

The Llama MMLU results are much lower than reported in the LLAMA papers.
https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU

itanh0b

Jun 16, 2023

•

edited Jun 16, 2023

Hello @clefourrier and @yahma ,

The lm-evaluation-harness commit used to evaluate on this Open LLM Leaderboard is 441e6ac. Evaluating at a later commit will give all LLaMA based models a boost (around 5 points on average). The previous evaluation is not fair to LLaMA since there was an issue with tokenization in the lm-evaluation-harness. @clefourrier Can you please look into it? The leaderboard would change significantly. Should I open a new issue with this? Just to give you a sense, LLaMA-65B would score 63.7 on the leaderboard with the tokenization fix. It is higher than Falcon-40B-Instruct. The obtained results on MMLU after the fix are consistent with the LLaMA paper. Guanaco 65B is the best model I evaluated so far and it reaches 66.8.

clefourrier

Hugging Face H4 org Jun 16, 2023

•

edited Jun 16, 2023

Hi @itanh0b and @yahma !
We've been doing some work to investigate the discrepancies for the MMLU scores, and we'll talk about it very soon, stay posted 🤗

clefourrier

Hugging Face H4 org Jun 16, 2023

•

edited Jun 16, 2023

@itanh0b just to be sure, could you develop on the tokenization problem you are talking about?

yahma

Jun 16, 2023

Ideally the leaderboard would be updated to use the fixed lm-evaluation-harness results. Falcon 40B performing better than the Llama models appears to be more a reflection of the issue with the scoring of the MMLU in older commits of lm-eval-harness rather than Falcon actually outperforming Llama. Having accurate, repeatable tests of the open models will benefit the OS community.

itanh0b

Jun 16, 2023

•

edited Jun 16, 2023

@clefourrier I'm not exactly sure how to explain it, I'm not an expert, but it seems it is related to how the context and continuation are encoded. Here is the pull request that fixes it. It's weird to me why it only affects LLaMA models.

itanh0b

Jun 17, 2023

•

edited Jun 17, 2023

@clefourrier the author of the commit provided me with an elaborate response regarding the tokenization problem in the lm-evaluation-harness commit used for Open LLM Leaderboard that puts LLaMA models at a disadvantage. Please check it out here.

clefourrier

Hugging Face H4 org Jun 19, 2023

Thank you!

wjfwzzc

Jun 20, 2023

After these two PRs merged to master branch, the MMLU scores (especially models based on LLaMA) seem normal now.
https://github.com/EleutherAI/lm-evaluation-harness/pull/531
https://github.com/EleutherAI/lm-evaluation-harness/pull/497
Would you consider re-evaluate all models on MMLU?

DanielTTY

Jun 21, 2023

latest code from Yao gets an MMLU score for Llama 65B to be about 63, while Falcon 40B gets only 49, far less than shown in the leaderboard.
Yao's code is available here: https://t.co/peXlNYcMux

clefourrier

Hugging Face H4 org Jun 21, 2023

Hi @DanielTTY - Yao's code is sadly incorrect with respect to the original, as they launch their evaluations in a generative way, not looking at choices log-likelihoods.

Hi @wjfwzzc We'll do an announcement later this week, stay tuned!

clefourrier

Hugging Face H4 org Jul 3, 2023

Hi @wjfwzzc , @yahma and @itanh0b !
Did you see our blog post on MMLU disrepancies, and did it answer your queries?

clefourrier

Hugging Face H4 org Jul 13, 2023

The leaderboard has now been updated with the correct results, closing this issue :)
Feel free to reopen if you have more questions

clefourrier changed discussion status to closed Jul 13, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment