Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
The results of BBH are inconsistant with official result of Qwen2
#827
by
peels7877
- opened
- BBH official
Qwen2-72B: 82.4 - BBH open_llm_leaderboard
Qwen2-72B::57.48
Qwen2-72B raw:0.7
Hi @peels7877 ,
The inconsistency between Qwen2-72B
official BBH results and the Leaderboard ones may be due to several factors. Firstly, it appears you've checked results for the instruct model, not the base model. The Leaderboard results for Qwen2-72B
BBH are 51.86 and 0.66 for Raw. Additionally, even though both evaluations are done in a 3-shot setting, there could be differences in the subsets split as BBH contains several subset splits, the metrics used (we use acc_norm), and the prompts utilised. Each of these elements can significantly influence the results.
alozowski
changed discussion status to
closed