open-llm-leaderboard/open_llm_leaderboard · The results of BBH are inconsistant with official result of Qwen2

Jul 9

BBH official
Qwen2-72B: 82.4
BBH open_llm_leaderboard
Qwen2-72B:：57.48
Qwen2-72B raw：0.7

Open LLM Leaderboard org Jul 9

The inconsistency between Qwen2-72B official BBH results and the Leaderboard ones may be due to several factors. Firstly, it appears you've checked results for the instruct model, not the base model. The Leaderboard results for Qwen2-72B BBH are 51.86 and 0.66 for Raw. Additionally, even though both evaluations are done in a 3-shot setting, there could be differences in the subsets split as BBH contains several subset splits, the metrics used (we use acc_norm), and the prompts utilised. Each of these elements can significantly influence the results.

alozowski changed discussion status to closed Jul 9