Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1022

GSM8K (5-shot) performance is quite different compared to running lm_eval locally

#755

by mobicham - opened May 24

Discussion

mobicham

May 24

There's a big difference between the GSM8K score reported on the leaderboard for Llama3-8B-Instruct (68.69) and the one reproduced locally with a batch-size=1 (75.97). With the same settings, the rest of the scores are more or less similar.
Why is this score so different ? This causes a large difference in the final score: HF score: 66.87, reproduced: 68.59 and makes many benchmarks comparisons kind of inaccurate.

Version: lm_eval: 0.4.1 / transformers: 4.41.0.dev0

clefourrier

Open LLM Leaderboard org May 28

Hi!
It would seem you did not follow the steps in the About tab, Reproducibility section. I suggest you try to rerun with the correct commit :)
Feel free to reopen if you still don't get the same results then!

clefourrier changed discussion status to closed May 28

mobicham

May 28

Thank you @clefourrier for your answer.
"for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs" is this done with the correct commit of lm_eval automatically ?

clefourrier

Open LLM Leaderboard org May 28

•

edited May 28

Hi!
No, you're right on this! We needed a lower baseline and took the one from the paper (which I doubt used the harness) but we should probably select the gpt2 score we ran with lm_eval as the lower baseline.

mobicham

May 28

Why not just use the GSM8K score from lm_eval directly?

clefourrier

Open LLM Leaderboard org May 28

For the lower baseline?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment