GSM8K (5-shot) performance is quite different compared to running lm_eval locally

#755
by mobicham - opened

There's a big difference between the GSM8K score reported on the leaderboard for Llama3-8B-Instruct (68.69) and the one reproduced locally with a batch-size=1 (75.97). With the same settings, the rest of the scores are more or less similar.
Why is this score so different ? This causes a large difference in the final score: HF score: 66.87, reproduced: 68.59 and makes many benchmarks comparisons kind of inaccurate.

Version: lm_eval: 0.4.1 / transformers: 4.41.0.dev0

Open LLM Leaderboard org

Hi!
It would seem you did not follow the steps in the About tab, Reproducibility section. I suggest you try to rerun with the correct commit :)
Feel free to reopen if you still don't get the same results then!

clefourrier changed discussion status to closed

Thank you @clefourrier for your answer.
"for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs" is this done with the correct commit of lm_eval automatically ?

Open LLM Leaderboard org
edited May 28

Hi!
No, you're right on this! We needed a lower baseline and took the one from the paper (which I doubt used the harness) but we should probably select the gpt2 score we ran with lm_eval as the lower baseline.

Why not just use the GSM8K score from lm_eval directly?

Open LLM Leaderboard org

For the lower baseline?

Sign up or log in to comment