GSM8K: which value will be selected?

by DavidGF - opened

Hello everyone,
I wanted to ask how there can be such a big difference in the GSM8K benchmark with our SauerkrautLM-Qwen-32b model?
We tested the model with the current lm-evaluation-harness suite.
Which benchmark value is chosen in the leaderboard: strict math or flexible math?

Many thanks in advance

Hugging Face H4 org

Hi @DavidGF ,
To reproduce our results, you can use the steps in the About, reproducibility section.
Did you notably use the same commit as us?

Hey @clefourrier
Thanks for the advice,
i have tested now with the exact leaderboard version and indeed the result is significantly lower than with the newest version of lm eval harness which is strange..

I can't explain the big difference here. I'll investigate it

DavidGF changed discussion status to closed
Hugging Face H4 org

The version we've been using dates from one year ago, and the normalization and end of generation tokens are not the same as the current version.
We've been keeping this one to ensure we evaluate all models in a strictly comparable setup, though it might not be the "best possible" setup.

Sign up or log in to comment