Tweaks / improvements: Add column with parameter count and mouseover labels for datasets

#795
by kerrmetric - opened

The new OpenLLM leaderboard is fantastic. A few improvements I'd love to see:

  • Add a column of parameter counts (or even better estimated VRAM needed)
  • Add mouseover labels for the eval datasets to help contextualize what these benchmarks cover

Ultra nice to have: Would love to see a few top closed models also evaluated against this methodology - doesn't need to refresh often, but helpful to see how much room there is to grow.

Open LLM Leaderboard org

Hi!
There's already a column for the parameter count you can toggle on the left, and you can also select a parameter size range on the right.
The eval contents are here and you'll also find vizualiser for some of the samples in our blog to help people contextualize them more. We have experimented with tooltips etc, but they tend to bloat the interface and don't render well on mobile so we won't go that route.

Lastly, we will not evaluate closed source models. As explained in our FAQ, the leaderboard focuses on open-source models to ensure transparency, reproducibility, and fairness. Closed-source models can change their APIs unpredictably, making it difficult to guarantee consistent and accurate scoring. Additionally, we rerun all evaluations on our cluster to maintain a uniform testing environment, which isn’t possible with closed-source models.

Thanks for the quick and thoughtful response!

Any way to quickly get a per benchmark look up (especially on mobile) of what the benchmark is would be helpful. Even making the link to the the documentation more prominent would be great. For what it's worth the description of the benchmarks in the blogpost is a little better and more understandable than the one in the current documentation link.

Very much agree with not evaluating closed source models at any regular frequency. But it would really be helpful to know how big a gap there is with the frontier models using your methodology. If folks benchmark Claude, Gemini or ChatGPT with the latest version of the harness, would you consider linking to it in your FAQs? I did a quick survey of what's available but I'm sure methodology differences make this quite inaccurate:

  • IFEval GPT-4 (not 4o) is ~75-85, so STOA open models are in the ball park
  • Big Bench Hard - I couldn't find a comparision to a modernish model
  • Math - couldn't find an eval on just Math level 5
  • MLMPro GPT-4o is ~70%, LLama-3-70B-Instruct is ~55% ... so that suggests vs your better methodology ChatGPT-4o would be ~70-75 vs LLama3-70B at 48
  • GPQA The methodology differences seem to be too great ... Llama3-70B scores only 4.9?! FWIW Claude 3.5 Sonnet is 59% for the dimond dataset with 0-shot CoT
  • MUSR Not trivial to compare methodologies. Looks like GPT-4 would be ~30% (since random is 50/50)?
Open LLM Leaderboard org

We could link to a study evaluating the closed source models at a given time, provided they gave normalised numbers like we do, that's a good idea.

clefourrier changed discussion status to closed

Here is one leaderboard by the folks who made the MMLU-Pro benchmark, that compares open and closed models using a consistant methodology: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro. Shows Claude 3.5 Sonnet at ~75% vs ~50% for Llama 3.0 70B, which is a helpful sense of relative scale.

Sign up or log in to comment