Spaces:
Running
on
CPU Upgrade
Brainstorming: evals changing through time
Following discussions on Twitter, I've been thinking about a hard to game leaderboard that we could implement.
Let's imagine we get a community sourced dataset of 500 non trivial multi choice QA questions (because loglikelihood evals they are less costly), that is user built, but not public.
We could split it in 40 questions per month (20 questions every two weeks), and either every month or every 2 weeks, evaluate our models on this "vibes" dataset. The question and model scores would only become public at the end of an evaluation period (so once every 2 weeks, or once every month).
It would allow to have a rolling score on all models, and see when models are trying to game the performance: if a model submitted in March retrospectively gets super good results on Jan to March, but bad from April, it's probably cheating. We could display the average best + min best at the end of the year.
The compute should be less costly if we have very few questions, but we risk having some months were the signal is bad (because a question or 2 are broken for example) - but it would still be an issue to figure out.
Tbh, this would have to be a parallel leaderboard than the Open LLM one, and I'm not sure how we could manage the compute atm, plus I would have the bandwidth on this from February at the earliest, but it's a draft of a direction we could go in. Wdyt folks? Any suggestions?
Moving to #481!