Platypus-30B score discrepancies

#111
by arielnlee - opened

Firstly, thanks so much for putting this space together, it's awesome!

I have a question about a recent model run (Platypus-30B). When we evaluated the model on the AI harness (using the same versions as the ones used on the Leaderboard) we got 64.6 for ARC, but the leaderboard lists ARC challenge as 57.59. The other 3 metrics are within 1-3% of what we got when we ran tests but a 7% difference seems extreme. Do you have any thoughts about why this could be?

Thank you!

Open LLM Leaderboard org

Hi! A number of users have reported discrepancies on ARC - I'm currently investigating, I'll keep you posted.

Thank you @clefourrier !

Open LLM Leaderboard org
edited Jul 17, 2023

It seems like it could be liked to the problem we identified with ARC scores and llama models (see here) - we'll make sure to re-run Platypus too!

awesome, thanks so much!

arielnlee changed discussion status to closed

Apparently closed -

arielnlee changed discussion status to open
arielnlee changed discussion status to closed

Sign up or log in to comment