drop score unreproducable in local server

#409
by leejunhyeok - opened

Hi.
I have been reproducing scores of leaderboard in my local a100 server
however, I am not able to reproduce same score on leaderboard on few tasks: gsm8k & drop score(with opened weights, like lvkaokao, etc)
this is an example patterns that model cannot reproduce.
used setting: identical to openllmleaderboard about page
in same doc id, model generates blank sequence. but leaderboard results are not blank.

also, score changes dramatically if I change batch size due to long evaluation time.(example case: drop, f1 score)

  • 2batch: 0.206
  • 8 batch: 0.1464

do you have any thoughts?

Open LLM Leaderboard org

Hi!
We use a batch size of 1 systematically (I need to update the doc about this, very sorry).
I'm surprised you can't reproduce the DROP results. Are you using the same commit?

We know that for generative evals, changing the batch size changes the score, it's an issue you can open on the harness.

Open LLM Leaderboard org

Independently, we have found out that DROP scores were unreliable, and have decided to remove it from the leaderboard, we wrote about our findings here

clefourrier changed discussion status to closed

that is, responsible haha

Sign up or log in to comment