Spaces:

HuggingFaceH4
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

733

GSM8K: which value will be selected?

#682

by DavidGF - opened 21 days ago

Discussion

DavidGF

21 days ago

Hello everyone,
I wanted to ask how there can be such a big difference in the GSM8K benchmark with our SauerkrautLM-Qwen-32b model?
We tested the model with the current lm-evaluation-harness suite.
Which benchmark value is chosen in the leaderboard: strict math or flexible math?

Many thanks in advance

clefourrier

Hugging Face H4 org 21 days ago

Hi @DavidGF ,
To reproduce our results, you can use the steps in the About, reproducibility section.
Did you notably use the same commit as us?

DavidGF

21 days ago

Hey @clefourrier
Thanks for the advice,
i have tested now with the exact leaderboard version and indeed the result is significantly lower than with the newest version of lm eval harness which is strange..

I can't explain the big difference here. I'll investigate it

DavidGF changed discussion status to closed 21 days ago

clefourrier

Hugging Face H4 org 20 days ago

The version we've been using dates from one year ago, and the normalization and end of generation tokens are not the same as the current version.
We've been keeping this one to ensure we evaluate all models in a strictly comparable setup, though it might not be the "best possible" setup.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment