Spaces:
Running
Running
<html> | |
<head> | |
<script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script> | |
<script src="main.bundle.js" type="module" fetchpriority="low" defer></script> | |
<meta name="viewport" content="width=device-width, initial-scale=1"> | |
<meta charset="utf8"> | |
<base target="_blank"> | |
<title>Open-LLM performances are plateauing, let’s make it steep again </title> | |
<link rel="stylesheet" href="style.css"> | |
</head> | |
<body> | |
<d-front-matter> | |
<script id='distill-front-matter' type="text/json">{ | |
"title": "Open-LLM performances are plateauing, let’s make it steep again ", | |
"description": "In this blog, we introduce the version 2 of the Open LLM Leaderboard, from the reasons for the change to the new evaluations and rankings.", | |
"published": "Jun 26, 2024", | |
"affiliation": {"name": "HuggingFace"}, | |
"authors": [ | |
{ | |
"author":"Clementine Fourrier", | |
"authorURL":"https://huggingface.co/clefourrier" | |
}, | |
{ | |
"author":"Nathan Habib", | |
"authorURL":"https://huggingface.co/saylortwift" | |
}, | |
{ | |
"author":"Alina Lozovskaya", | |
"authorURL":"https://huggingface.co/alozowski" | |
}, | |
{ | |
"author":"Konrad Szafer", | |
"authorURL":"https://huggingface.co/KonradSzafer" | |
}, | |
{ | |
"author":"Thomas Wolf", | |
"authorURL":"https://huggingface.co/thomwolf" | |
} | |
], | |
"katex": { | |
"delimiters": [ | |
{"left": "$$", "right": "$$", "display": false} | |
] | |
} | |
} | |
</script> | |
</d-front-matter> | |
<d-title> | |
<h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1> | |
<div id="title-plot" class="main-plot-container l-screen"> | |
<figure> | |
<img src="assets/images/banner.png" alt="Banner"> | |
</figure> | |
</div> | |
</d-title> | |
<d-byline></d-byline> | |
<d-article> | |
<d-contents> | |
</d-contents> | |
<p>Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago, when they wanted to reproduce and compare results from several published models. | |
It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful but most of the case, | |
just using optimized prompts or evaluation setup to give best chances to the models. They therefore decided to create a place where reference models would be | |
evaluated in the exact same setup (same questions, asked in the same order, …), to gather completely reproducible and comparable results; and that’s how the | |
Open LLM Leaderboard was born!</p> | |
<p> Following a series of highly-visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.</p> | |
<p> We estimate that around 300 000 community members use and collaborate on it monthly through submissions and discussions; usually to: </p> | |
<ul> | |
<li> Find state-of-the-art open source releases as the leaderboardit provides reproducible scores separating marketing fluff from actual progress in the field.</li> | |
<li> Evaluate their own work, be it pretraining or finetuning, comparing methods in the open and to the best existing models, and earning public recognition for their work.</li> | |
</ul> | |
<p> However, with success, both in the leaderboard and the increasing performances of the models came challenges and after one intense year and a lot of community feedback, we thought it was time for an upgrade! Therefore, we’re introducing the Open LLM Leaderboard v2!</p> | |
<p>Here is why we think a new leaderboard was needed 👇</p> | |
<h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2> | |
<h3>The need for a more challenging leaderboard</h3> | |
<p> | |
Over the past year, the benchmarks we were using got overused/saturated: | |
</p> | |
<ol> | |
<li>They became too easy for models. For instance on HellaSwag, MMLU and ARC, models are now reaching baseline human performance, a phenomenon called saturation.</li> | |
<li>Some newer models also showed signs of contamination. By this we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting general performances of model and started to over-fit on some evaluation dataset instead of being reflective of the more general performances of the task being tested. This was in particular the case for GSM8K and TruthfulQA which were included in some instruction fine-tuning sets.</li> | |
<li>Some benchmarks contained errors: MMLU was recently investigated in depth by several groups who surfaced mistakes in its responses and proposed new versions. Another example was the fact that GSM8K used some specific end of generation token (`:`) which unfairly pushed down performance of many verbose models.</li> | |
</ol> | |
</d-article> | |
<d-appendix> | |
<d-bibliography src="bibliography.bib"></d-bibliography> | |
</d-appendix> | |
<script> | |
const article = document.querySelector('d-article'); | |
const toc = document.querySelector('d-contents'); | |
if (toc) { | |
const headings = article.querySelectorAll('h2, h3, h4'); | |
let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`; | |
let prevLevel = 0; | |
for (const el of headings) { | |
// should element be included in TOC? | |
const isInTitle = el.parentElement.tagName == 'D-TITLE'; | |
const isException = el.getAttribute('no-toc'); | |
if (isInTitle || isException) continue; | |
el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_")) | |
const link = '<a target="_self" href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>'; | |
const level = el.tagName === 'H2' ? 0 : (el.tagName === 'H3' ? 1 : 2); | |
while (prevLevel < level) { | |
ToC += '<ul>' | |
prevLevel++; | |
} | |
while (prevLevel > level) { | |
ToC += '</ul>' | |
prevLevel--; | |
} | |
if (level === 0) | |
ToC += '<div>' + link + '</div>'; | |
else | |
ToC += '<li>' + link + '</li>'; | |
} | |
while (prevLevel > 0) { | |
ToC += '</ul>' | |
prevLevel--; | |
} | |
ToC += '</nav>'; | |
toc.innerHTML = ToC; | |
toc.setAttribute('prerendered', 'true'); | |
const toc_links = document.querySelectorAll('d-contents > nav a'); | |
window.addEventListener('scroll', (_event) => { | |
if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) { | |
// Then iterate forwards, on the first match highlight it and break | |
find_active: { | |
for (let i = headings.length - 1; i >= 0; i--) { | |
if (headings[i].getBoundingClientRect().top - 50 <= 0) { | |
if (!toc_links[i].classList.contains("active")) { | |
toc_links.forEach((link, _index) => { | |
link.classList.remove("active"); | |
}); | |
toc_links[i].classList.add('active'); | |
} | |
break find_active; | |
} | |
} | |
toc_links.forEach((link, _index) => { | |
link.classList.remove("active"); | |
}); | |
} | |
} | |
}); | |
} | |
</script> | |
</body> | |
</html> |