Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

Platypus-30B score discrepancies

#111

by arielnlee - opened Jul 16, 2023

Jul 16, 2023

•

edited Jul 16, 2023

Firstly, thanks so much for putting this space together, it's awesome!

I have a question about a recent model run (Platypus-30B). When we evaluated the model on the AI harness (using the same versions as the ones used on the Leaderboard) we got 64.6 for ARC, but the leaderboard lists ARC challenge as 57.59. The other 3 metrics are within 1-3% of what we got when we ran tests but a 7% difference seems extreme. Do you have any thoughts about why this could be?

Thank you!

Open LLM Leaderboard org Jul 17, 2023

Hi! A number of users have reported discrepancies on ARC - I'm currently investigating, I'll keep you posted.

Jul 17, 2023

Thank you @clefourrier !

Open LLM Leaderboard org Jul 17, 2023

•

edited Jul 17, 2023

It seems like it could be liked to the problem we identified with ARC scores and llama models (see here) - we'll make sure to re-run Platypus too!

Jul 17, 2023

awesome, thanks so much!

arielnlee changed discussion status to closed Jul 17, 2023

Wubbbi

Jul 17, 2023

•

edited Jul 17, 2023

Apparently closed -

arielnlee changed discussion status to open Jul 17, 2023

arielnlee changed discussion status to closed Jul 17, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment