Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago, when they wanted to reproduce and compare results from several published models. It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful but most of the case, just using optimized prompts or evaluation setup to give best chances to the models. They therefore decided to create a place where reference models would be evaluated in the exact same setup (same questions, asked in the same order, …), to gather completely reproducible and comparable results; and that’s how the Open LLM Leaderboard was born!
Following a series of highly-visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.
We estimate that around 300 000 community members use and collaborate on it monthly through submissions and discussions; usually to:
However, with success, both in the leaderboard and the increasing performances of the models came challenges and after one intense year and a lot of community feedback, we thought it was time for an upgrade! Therefore, we’re introducing the Open LLM Leaderboard v2!
Here is why we think a new leaderboard was needed 👇
Over the past year, the benchmarks we were using got overused/saturated:
:
) which unfairly pushed down performance of many verbose models.We thus chose to completely change the evaluations we are running for the Open LLM Leaderboard v2!
We started looking for new benchmarks with uncontaminated, high quality datasets, making use of reliable metrics, and measuring model capabilities of interest.
We decided to cover the following general tasks: knowledge testing (📚), reasoning on short and long contexts (💭), complex mathematical abilities, and tasks well correlated with human preference (🤝), like instruction following.
We cover these tasks with 6 benchmarks. Let us present them briefly:
📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities as well as the increase of contamination). MMLU-Pro presents the models with 10 choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is higher quality than the original, and (for the moment) harder.
📚 GPQA (Google-Proof Q&A Benchmark, paper). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, …) to be hard to answer by laypersons, but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce the risks of contamination. (This is also why we don’t provide a plain text example from this dataset, as requested by the authors in the paper).
MuSR (Multistep Soft Reasoning, paper). MuSR is a very fun new dataset, made of algorithmically generated complex problems of around 1K words in length. Problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long range context parsing. Few models score better than random performance.
🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper). MATH is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.
🤝 IFEval (Instruction Following Evaluation, paper). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.
🧮 🤝 BBH (Big Bench Hard, paper). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.
In summary, our criterion were:
Should we have included more evaluations?
We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.
But selecting new benchmarks is not the whole story, we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.
We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.
This change is more significant than it may seem as it can be seen as changing the weight assigned to each benchmark in the final average score.
On the above figure, we plot the mean scores for our evaluations, with normalized scoreon the right, and raw scores on the left. If you take a look at the right hand side, you would conclude that the hardest benchmarks are MATH Level 5 and MMLU-Pro (lowest raw averages). However, our 2 hardest evaluations are actually MATH Level 5 and GPQA, which is considerably harder (PhD level questions!) - most models of today get close to random performance on it, and there is thus a huge difference between unnormalized score and normalized score where the random number baseline is assigned zero points!
This change thus also affects model ranking in general. Say we have two very hard evaluations, one generative and one multichoice with 2 option samples. Model A gets 0 on the generative evaluation, and 52 on the multichoice, and model B gets 10 on the generative and 40 on the multichoice. If you look at the raw averages, you could conclude that model A is better, with an average score of 26, while model B’s average is 25. However, for the multichoice benchmark, they are in fact both similarly bad (!): 52 is almost a random score on the multichoice evaluation, and 40 is an unlucky random score. This becomes obvious when taking the normalized scores, where A gets 0 and B gets around 1. However, on the generative evaluation, model B is 10 points better! If we take the normalized averages, we would get 5 for model B and almost 0 for model A, hence a very different ranking.
A year ago, we made the choice to use the Harness (lm-eval) from EleutherAI to power our evaluations. It provides a standard and stable implementation for a number of tasks. To ensure fairness and reproducibility, we pinned the version we were using, which allowed us to compare all models in an apples to apples setup, as all evaluations were run in exactly the same way, on the same hardware, using the same evaluation suite commit and parameters.
However, lm-eval
evolved, and the implementation of some tasks or metrics changed, which led to discrepancies between 1) evaluation results people would get on more recent versions of the harness and 2) our results using our pinned version.
For the new version of the Open LLM Leaderboard, we have therefore worked together with the amazing EleutherAI team (notably Hailey Schoelkopf, so many, huge kudos!) to update the harness.
Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.
On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.
This should allow us to keep our version up to date with new features added in the future!
Enough said on the leaderboard backend and metrics, now let’s turn to the models and model selection/submission.
Throughout the year, we’ve evaluated more than 7.5K models, and observed that not all of them were used as much by the community.
The most used ones are usually new base pretrained models, often built by using a lot of compute and which can later be fine-tuned by the community for their own use cases (such as Meta’s Llama3 or Alibaba’s Qwen2). Some high quality chat or instruction models also find a large user community, for instance Cohere’s Command + R, and become also strong starting points for community experiments. ♥️
However, the story can be different for other models, even when ranking on top of the leaderboard. A number of models are experimental, fascinating and impressive concatenations of more than 20 steps of fine-tuning or merging.
However these models present some challenges as:
To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.
In this list, you’ll find LLMs from model creators who spent time and care on creating and delivering new cool models. We include big companies like Meta or Google, startups like Cohere or Mistral, collectives, like EleutherAI or NousResearch, and users, among many others.
This list will be evolutive based on community suggestions, and will aim to include SOTA LLMs as they come out. We will also try to evaluate these models in priority, as they are more valuable to the community.
We hope it will also make it easier for non ML users to better make their choice among the many, many models we evaluate.
For the Open LLM Leaderboard v1, evaluations were run in a “first come, first served” manner. However, some users were submitting many new LLMs at once, blocking the queue for the rest of the community with experimental or low quality models.
As the Open LLM Leaderboard is running on the spare cycles of the Hugging Face science cluster, our automatic evaluations can only take place when nodes are free. Any other job has a higher priority over our evaluations. When a new model is training or a dataset is brewing, users sometimes need to wait at least a couple of days, sometimes longer, for evaluations to be run. (But then, they get a cool model or dataset from our research team, like Idefics or FineWeb-Edu)!
For the Open LLM Leaderboard v2, we have introduced a voting system for submitted models. It will prioritize running models with the most votes first, and if a model has an extremely high number of votes when the cluster is full, we’ll consider running it manually.
For accountability, we request users who vote to be connected to their Hugging Face account, and we store all the votes. This will therefore prioritize models that the community is enthusiastic about, no matter their origin.
Our regular users might have noticed that in the last month, our front end became much faster.
This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard gradio
component! It notably loads data client side, which makes any column selection or search virtually instantaneous!
We’ve also decided to remove the FAQ and About tabs from the Leaderboard, as we noticed that a number of users were not finding the tabs, and it was crowding the interface. They are now in their own dedicated documentation page, that you can find here! # Results!
For the version 2, we made the choice to initialize the leaderboard with the maintainer’s choice models only to start. But as always, submissions are open!
When looking at the top 10 of the Open LLM Leaderboard, and comparing the v2 and v1, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.
Rank | Leaderboard v1 | Leaderboard v2 |
---|---|---|
⭐ | abacusai/Smaug-72B-v0.1 | meta-llama/Meta-Llama-3-70B-Instruct |
2 | meta-llama/Meta-Llama-3-70B-Instruct | microsoft/Phi-3-medium-4k-instruct |
3 | abacusai/Smaug-34B-v0.1 | 01-ai/Yi-1.5-34B-Chat |
4 | mlabonne/AlphaMonarch-7B | abacusai/Smaug-72B-v0.1 |
5 | mlabonne/Beyonder-4x7B-v3 | CohereForAI/c4ai-command-r-plus |
6 | 01-ai/Yi-1.5-34B-Chat | Qwen/Qwen1.5-110B-Chat |
7 | CohereForAI/c4ai-command-r-plus | NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO |
8 | upstage/SOLAR-10.7B-Instruct-v1.0 | meta-llama/Meta-Llama-3-70B |
9 | meta-llama/Meta-Llama-3-70B | 01-ai/Yi-1.5-9B-Chat |
10 | 01-ai/Yi-1.5-34B | 01-ai/Yi-1.5-34B-32K |
We’ve been particularly impressed by Llama-70B-instruct, who is the best model across many evaluations (though it has 15 points less than it’s base counterpart on GPQA - does instruct tuning remove knowledge?).
Interestingly, a new challenger climbed the ranks to arrive in 2nd place despite its smaller size: Phi-3-medium-4K-instruct, only 13B parameters but a performance equivalent to models 2 to 4 times its size.
We also provide the most important top and bottom ranking changes.
Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.
For example, our different evaluations results are not all correlated with one another, which is expected.
MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).
IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.
If you are more interested in knowledge than alignment with human preference, the most relevant evaluations for you would be MMLU-Pro and GPQA.
Both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with reference MMLU scores from the Open LLM Leaderboard v1. However, since GPQA is much harder, the scores are overall much lower.
MATH-Lvl5 is, obviously, interesting for people concerned with math capabilities. Its results are correlated with GSM8K, except for some outliers. In the green box are models which scored 0 on GSM8K in the first leaderboard, but now have good scores on MATH-Level5 (mostly models from 01-ai) - it’s likely they were penalized by the previous format and stop tokens. In the red box are models which scored high on GSM8K but are now at 0 on MATH-Lvl5.From our current observations, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).This seems to imply that some chat tuning can impair math capabilities (from our observations, by making models exceedingly verbose).
MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.
Much like the v1 drove model development during the last year, especially for the community, we hope that the v2 will be a cornerstone of model evaluations.
You’ll still be able to find all the v1 results in the Open LLM Leaderboard Archive, and we are preparing an in depth blog about what we learned while taking care of the leaderboard!
When looking at the evolution of all submitted models on the Open LLM Leaderboard v1 through time, we observe a trend where we go from bigger (red dots) to smaller (yellow dots), but better performing models.
We hope that we will observe similar patterns of progress with the leaderboard v2, where our starting point is much lower (black dots).