Question about the benchmarks
Hi,
I'm interested in understanding the benchmarking methodology used to compare your AI models with those from other companies and teams, specifically with regards to the lm-evaluation-harness
framework.
For example, I've noticed that the reported MMLU and MMLU-PRO scores for Llama-3.2-3B-Instruct and Qwen2.5-3B-Instruct appear to be displayed as lower than expected (and also lower than what is reported by Meta and Qwen themselves).
Could you provide more details on the settings or configuration used for these benchmark? I'd like to make sure that the comparisons are accurate. Thank you.
Hi,
some details are already present in the blogpost: https://huggingface.co/blog/falcon3#:~:text=In%20our%20internal%20evaluation%20pipeline%3A
Hi, thank you, yes I did read that prior to posting but unfortunately it only provides this one detail:
We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1).
Part of why I'm asking is because the official open_llm_leaderboard which is powered by the same lm-evaluation-harness
is reporting these results on MMLU-PRO:
As you can see, Falcon3-3B-Instruct is slightly outscored here by both models on MMLU-PRO. However, according to your readme, the results are very different:
So, I'm just trying to understand what caused this big discrepancy between scores?
the difference is in "We report raw scores obtained by applying chat template without fewshot_as_multiturn (unlike Llama3.1)"
- we use raw scores whereas HF leaderboard uses normalized scores
- --fewshot_as_multiturn is not enabled in our evals whereas it is in HF evals score.
Got it, so does that mean that fewshot made all the difference, because even the raw scores are showing falcon 3b as scoring slightly lower?
I'm just curious if this is an accurate reflection when, for example, in the falcon readme there's a 59.9%
decrease in mmlu-pro score between Llama-3.2-3B-Instruct and Falcon3-3B-Instruct.
Here's the raw score reported by the leaderboard: