Spaces:
Running
on
CPU Upgrade
Platypus-30B score discrepancies
Firstly, thanks so much for putting this space together, it's awesome!
I have a question about a recent model run (Platypus-30B). When we evaluated the model on the AI harness (using the same versions as the ones used on the Leaderboard) we got 64.6 for ARC, but the leaderboard lists ARC challenge as 57.59. The other 3 metrics are within 1-3% of what we got when we ran tests but a 7% difference seems extreme. Do you have any thoughts about why this could be?
Thank you!
Hi! A number of users have reported discrepancies on ARC - I'm currently investigating, I'll keep you posted.
Thank you @clefourrier !
It seems like it could be liked to the problem we identified with ARC scores and llama models (see here) - we'll make sure to re-run Platypus too!
awesome, thanks so much!
Apparently closed -