score gap between leaderboard and local running

#605
by fzzhang - opened

Hi, we are currently testing our model fangzhaoz/pearl7B_tuneonGSM8K, the leaderboard public score is "acc":0.3995451099317665 for GSM8k task while we got "exact_match" value 0.7688 with "Stderr" 0.0116 by running on lm harness evaluate locally.

Wondering why there is such a large gap, is it because We're looking at a wrong metric score locally?

FYI, here's the code we use for local evaluation:
lm_eval --model hf --model_args pretrained=fangzhaoz/pearl7B_tuneonGSM8K --tasks gsm8k --device cuda:0 --batch_size 8

and here's the results it returns:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 2 get-answer 5 exact_match 0.7688 ± 0.0116
Open LLM Leaderboard org

Hi!
Did you follow the same steps as we did, notably using the same commit of the harness? (I think you used a more recent version of the harness).
(Everything is detailed in the About page).
You can also take a look at the difference between your outputs and the details we save (accessible by clicking the page up icon next to the model name in the leaderboard).

clefourrier changed discussion status to closed

Sign up or log in to comment