Open-LLM performances are plateauing, let’s make it steep again

Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago, when they wanted to reproduce and compare results from several published models. It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful but most of the case, just using optimized prompts or evaluation setup to give best chances to the models. They therefore decided to create a place where reference models would be evaluated in the exact same setup (same questions, asked in the same order, …), to gather completely reproducible and comparable results; and that’s how the Open LLM Leaderboard was born!

Following a series of highly-visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.

We estimate that around 300 000 community members use and collaborate on it monthly through submissions and discussions; usually to:

Find state-of-the-art open source releases as the leaderboardit provides reproducible scores separating marketing fluff from actual progress in the field.
Evaluate their own work, be it pretraining or finetuning, comparing methods in the open and to the best existing models, and earning public recognition for their work.

However, with success, both in the leaderboard and the increasing performances of the models came challenges and after one intense year and a lot of community feedback, we thought it was time for an upgrade! Therefore, we’re introducing the Open LLM Leaderboard v2!

Here is why we think a new leaderboard was needed 👇

Harder, better, faster, stronger: Introducing the Leaderboard v2

The need for a more challenging leaderboard

Over the past year, the benchmarks we were using got overused/saturated:

They became too easy for models. For instance on HellaSwag, MMLU and ARC, models are now reaching baseline human performance, a phenomenon called saturation.
Some newer models also showed signs of contamination. By this we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting general performances of model and started to over-fit on some evaluation dataset instead of being reflective of the more general performances of the task being tested. This was in particular the case for GSM8K and TruthfulQA which were included in some instruction fine-tuning sets.
Some benchmarks contained errors: MMLU was recently investigated in depth by several groups who surfaced mistakes in its responses and proposed new versions. Another example was the fact that GSM8K used some specific end of generation token (`:`) which unfairly pushed down performance of many verbose models.