Why is there a difference in the evaluation benchmarks between yours and those of the leaderboard?
#5
by
diegottt
- opened
https://qwenlm.github.io/blog/qwen2.5/ vs https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
MATH 80 vs MATH 0
GPQA 45.5 vs GPQA 9.62
etc....