Spaces:
Running
Running
Clémentine
commited on
Commit
•
f3ab0dd
1
Parent(s):
b76606d
added thom's edits
Browse files- dist/index.html +39 -32
- src/index.html +39 -32
dist/index.html
CHANGED
@@ -191,28 +191,31 @@
|
|
191 |
</li>
|
192 |
</ul>
|
193 |
<p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
|
194 |
-
<p>In this list, you’ll find LLMs from model creators
|
195 |
-
<p>
|
196 |
-
<p>We hope it will also make it easier for non ML users to
|
197 |
|
198 |
<h3>Voting on model relevance</h3>
|
199 |
-
<p>For the Open LLM Leaderboard
|
200 |
-
<p>
|
201 |
-
<p>
|
202 |
-
|
203 |
-
|
204 |
<h3>Better and simpler interface</h3>
|
205 |
-
<p>
|
206 |
-
<p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous!</p>
|
207 |
-
<p>We’ve also decided to
|
208 |
-
|
209 |
-
|
210 |
<h2>New leaderboard, new results!</h2>
|
|
|
211 |
|
212 |
-
<h3>What
|
213 |
|
214 |
-
<p>
|
215 |
-
|
|
|
|
|
|
|
|
|
216 |
<tr>
|
217 |
<th>Rank</th>
|
218 |
<th>Leaderboard v1</th>
|
@@ -282,11 +285,13 @@
|
|
282 |
</div>
|
283 |
</div>
|
284 |
|
285 |
-
|
286 |
-
<p>Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.</p>
|
287 |
|
288 |
-
|
|
|
289 |
|
|
|
|
|
290 |
<div class="main-plot-container">
|
291 |
<figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
|
292 |
<div id="heatmap">
|
@@ -294,9 +299,8 @@
|
|
294 |
</div>
|
295 |
</div>
|
296 |
|
297 |
-
<p>MMLU-Pro, BBH and ARC-challenge are well correlated
|
298 |
-
|
299 |
-
<p>IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.</p>
|
300 |
|
301 |
<div class="l-body">
|
302 |
<figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
|
@@ -304,7 +308,8 @@
|
|
304 |
</div>
|
305 |
|
306 |
|
307 |
-
<p>If you are
|
|
|
308 |
|
309 |
<div class="l-body">
|
310 |
<figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
|
@@ -312,30 +317,32 @@
|
|
312 |
</div>
|
313 |
|
314 |
|
315 |
-
<p>
|
|
|
316 |
|
317 |
<div class="l-body">
|
318 |
<figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
|
319 |
<div id="math"></div>
|
320 |
</div>
|
321 |
|
322 |
-
<p>
|
323 |
-
|
324 |
<p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
|
|
|
325 |
|
326 |
<h2>What’s next?</h2>
|
327 |
-
<p>Much like the
|
328 |
-
<p>
|
|
|
329 |
|
330 |
<div class="l-body">
|
331 |
<figure><img src="assets/images/timewise_analysis_full.png"/></figure>
|
332 |
<div id="timewise"></div>
|
333 |
</div>
|
334 |
|
335 |
-
|
336 |
-
<p>
|
337 |
-
|
338 |
-
|
339 |
|
340 |
</d-article>
|
341 |
|
|
|
191 |
</li>
|
192 |
</ul>
|
193 |
<p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
|
194 |
+
<p>In this list, you’ll find LLMs from model creators with access to a lot of compute power such as Meta,Google, Cohere or Mistral, as well as well known collectives, like EleutherAI or NousResearch, and power users of the Hugging Face hub, among others.</p>
|
195 |
+
<p>We plan to make this list evolutive based on community suggestions and our own observations, and will aim to include as much as possible SOTA LLMs as they come out and keep evaluating these models in priority.</p>
|
196 |
+
<p>We hope it will also make it easier for non ML users to orient themselves among the many, many models we’ll rank on the leaderboard.</p>
|
197 |
|
198 |
<h3>Voting on model relevance</h3>
|
199 |
+
<p>For the previous version of the Open LLM Leaderboard, evaluations were usually run in a “first submitted, first evaluated” manner. With users sometimes submitting many LLMs variants at once and the Open LLM Leaderboard running on the limited compute of the spare cycles on the Hugging Face science cluster, we’ve decided to introduce a voting system for submitted models. The community will be able to vote for models and we will prioritize running models with the most votes first, hopefully surfacing the most awaited models on the top of the priority stack. If a model gets an extremely high number of votes when the cluster is full, we could even consider running it manually in place of other internal jobs at Hugging Face.</p>
|
200 |
+
<p>To avoid spamming the vote system, users will need to be connected to their Hugging Face account to vote, and we will save the votes. We hope this system will help us prioritize models that the community is enthusiastic about.</p>
|
201 |
+
<p>Finally, we’ve been hard at work on improving and simplifying the leaderboard interface itself.</p>
|
202 |
+
|
|
|
203 |
<h3>Better and simpler interface</h3>
|
204 |
+
<p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
|
205 |
+
<p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a component that you can re-use yourself in your own leaderboard!</p>
|
206 |
+
<p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
|
207 |
+
|
|
|
208 |
<h2>New leaderboard, new results!</h2>
|
209 |
+
<p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
|
210 |
|
211 |
+
<h3>What do the rankings look like?</h3>
|
212 |
|
213 |
+
<p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
|
214 |
+
<p>We’ve been particularly impressed by Llama-70B-instruct, ranking top across many evaluations (even though this instruct version loses 15 points to its pretrained version counterpart on GPQA which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge?).</p>
|
215 |
+
<p>Also very interesting is the fact that a new challenger climbed the ranks to reach 2nd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
|
216 |
+
<p>Here is a detail of the changes in rankings:</p>
|
217 |
+
|
218 |
+
<table>
|
219 |
<tr>
|
220 |
<th>Rank</th>
|
221 |
<th>Leaderboard v1</th>
|
|
|
285 |
</div>
|
286 |
</div>
|
287 |
|
288 |
+
<p>Let’s finish with some food for thoughts and advices from the maintainer’s team.</p>
|
|
|
289 |
|
290 |
+
<h3>Which evaluations should you pay most attention to?</h3>
|
291 |
+
<p>Depending on your practical use case, you should probably focus on various aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you might be more interested in specific capabilities.</p>
|
292 |
|
293 |
+
<p>In particular, we observed that our different evaluations results are not always correlated with one another as illustrated on this correlation matrice:</p>
|
294 |
+
|
295 |
<div class="main-plot-container">
|
296 |
<figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
|
297 |
<div id="heatmap">
|
|
|
299 |
</div>
|
300 |
</div>
|
301 |
|
302 |
+
<p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
|
303 |
+
<p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
|
|
|
304 |
|
305 |
<div class="l-body">
|
306 |
<figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
|
|
|
308 |
</div>
|
309 |
|
310 |
|
311 |
+
<p>If you are especially interested in model knowledge rather than alignment or chat capabilities, the most relevant evaluations for you will likely be MMLU-Pro and GPQA.</p>
|
312 |
+
<p>Let’s see how performances on these updated benchmarks compare to our evaluation on the previous version of the leaderboard.</p>
|
313 |
|
314 |
<div class="l-body">
|
315 |
<figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
|
|
|
317 |
</div>
|
318 |
|
319 |
|
320 |
+
<p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
|
321 |
+
<p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
|
322 |
|
323 |
<div class="l-body">
|
324 |
<figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
|
325 |
<div id="math"></div>
|
326 |
</div>
|
327 |
|
328 |
+
<p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
|
329 |
+
<p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
|
330 |
<p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
|
331 |
+
<p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
|
332 |
|
333 |
<h2>What’s next?</h2>
|
334 |
+
<p>Much like the first version of the Open LLM Leaderboard pushed a community approach to model development during the past year, we hope that the new version 2 will be a milestone of open and reproducible model evaluations.</p>
|
335 |
+
<p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
|
336 |
+
<p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
|
337 |
|
338 |
<div class="l-body">
|
339 |
<figure><img src="assets/images/timewise_analysis_full.png"/></figure>
|
340 |
<div id="timewise"></div>
|
341 |
</div>
|
342 |
|
343 |
+
<p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>
|
344 |
+
<p>If you’ve read to this point, thanks a lot, we hope you’ll enjoy this new version of the Open LLM Leaderboard. May the open-source winds push our LLMs boats to sail far away on the sea of deep learning.</p>
|
345 |
+
|
|
|
346 |
|
347 |
</d-article>
|
348 |
|
src/index.html
CHANGED
@@ -191,28 +191,31 @@
|
|
191 |
</li>
|
192 |
</ul>
|
193 |
<p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
|
194 |
-
<p>In this list, you’ll find LLMs from model creators
|
195 |
-
<p>
|
196 |
-
<p>We hope it will also make it easier for non ML users to
|
197 |
|
198 |
<h3>Voting on model relevance</h3>
|
199 |
-
<p>For the Open LLM Leaderboard
|
200 |
-
<p>
|
201 |
-
<p>
|
202 |
-
|
203 |
-
|
204 |
<h3>Better and simpler interface</h3>
|
205 |
-
<p>
|
206 |
-
<p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous!</p>
|
207 |
-
<p>We’ve also decided to
|
208 |
-
|
209 |
-
|
210 |
<h2>New leaderboard, new results!</h2>
|
|
|
211 |
|
212 |
-
<h3>What
|
213 |
|
214 |
-
<p>
|
215 |
-
|
|
|
|
|
|
|
|
|
216 |
<tr>
|
217 |
<th>Rank</th>
|
218 |
<th>Leaderboard v1</th>
|
@@ -282,11 +285,13 @@
|
|
282 |
</div>
|
283 |
</div>
|
284 |
|
285 |
-
|
286 |
-
<p>Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.</p>
|
287 |
|
288 |
-
|
|
|
289 |
|
|
|
|
|
290 |
<div class="main-plot-container">
|
291 |
<figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
|
292 |
<div id="heatmap">
|
@@ -294,9 +299,8 @@
|
|
294 |
</div>
|
295 |
</div>
|
296 |
|
297 |
-
<p>MMLU-Pro, BBH and ARC-challenge are well correlated
|
298 |
-
|
299 |
-
<p>IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.</p>
|
300 |
|
301 |
<div class="l-body">
|
302 |
<figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
|
@@ -304,7 +308,8 @@
|
|
304 |
</div>
|
305 |
|
306 |
|
307 |
-
<p>If you are
|
|
|
308 |
|
309 |
<div class="l-body">
|
310 |
<figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
|
@@ -312,30 +317,32 @@
|
|
312 |
</div>
|
313 |
|
314 |
|
315 |
-
<p>
|
|
|
316 |
|
317 |
<div class="l-body">
|
318 |
<figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
|
319 |
<div id="math"></div>
|
320 |
</div>
|
321 |
|
322 |
-
<p>
|
323 |
-
|
324 |
<p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
|
|
|
325 |
|
326 |
<h2>What’s next?</h2>
|
327 |
-
<p>Much like the
|
328 |
-
<p>
|
|
|
329 |
|
330 |
<div class="l-body">
|
331 |
<figure><img src="assets/images/timewise_analysis_full.png"/></figure>
|
332 |
<div id="timewise"></div>
|
333 |
</div>
|
334 |
|
335 |
-
|
336 |
-
<p>
|
337 |
-
|
338 |
-
|
339 |
|
340 |
</d-article>
|
341 |
|
|
|
191 |
</li>
|
192 |
</ul>
|
193 |
<p>To highlight high quality models in the leaderboard, as well as prioritize the most useful models for evaluation, we’ve therefore decided to introduce a category called “maintainer’s choice” ⭐.</p>
|
194 |
+
<p>In this list, you’ll find LLMs from model creators with access to a lot of compute power such as Meta,Google, Cohere or Mistral, as well as well known collectives, like EleutherAI or NousResearch, and power users of the Hugging Face hub, among others.</p>
|
195 |
+
<p>We plan to make this list evolutive based on community suggestions and our own observations, and will aim to include as much as possible SOTA LLMs as they come out and keep evaluating these models in priority.</p>
|
196 |
+
<p>We hope it will also make it easier for non ML users to orient themselves among the many, many models we’ll rank on the leaderboard.</p>
|
197 |
|
198 |
<h3>Voting on model relevance</h3>
|
199 |
+
<p>For the previous version of the Open LLM Leaderboard, evaluations were usually run in a “first submitted, first evaluated” manner. With users sometimes submitting many LLMs variants at once and the Open LLM Leaderboard running on the limited compute of the spare cycles on the Hugging Face science cluster, we’ve decided to introduce a voting system for submitted models. The community will be able to vote for models and we will prioritize running models with the most votes first, hopefully surfacing the most awaited models on the top of the priority stack. If a model gets an extremely high number of votes when the cluster is full, we could even consider running it manually in place of other internal jobs at Hugging Face.</p>
|
200 |
+
<p>To avoid spamming the vote system, users will need to be connected to their Hugging Face account to vote, and we will save the votes. We hope this system will help us prioritize models that the community is enthusiastic about.</p>
|
201 |
+
<p>Finally, we’ve been hard at work on improving and simplifying the leaderboard interface itself.</p>
|
202 |
+
|
|
|
203 |
<h3>Better and simpler interface</h3>
|
204 |
+
<p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
|
205 |
+
<p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a component that you can re-use yourself in your own leaderboard!</p>
|
206 |
+
<p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
|
207 |
+
|
|
|
208 |
<h2>New leaderboard, new results!</h2>
|
209 |
+
<p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
|
210 |
|
211 |
+
<h3>What do the rankings look like?</h3>
|
212 |
|
213 |
+
<p>Taking a look at the top 10 models on the previous version of the Open LLM Leaderboard, and comparing with this updated version, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
|
214 |
+
<p>We’ve been particularly impressed by Llama-70B-instruct, ranking top across many evaluations (even though this instruct version loses 15 points to its pretrained version counterpart on GPQA which begs the question whether the particularly extensive instruction fine-tuning done by the Meta team on this model affected some expert/graduate level knowledge?).</p>
|
215 |
+
<p>Also very interesting is the fact that a new challenger climbed the ranks to reach 2nd place despite its smaller size. With only 13B parameters, Microsoft’s Phi-3-medium-4K-instruct model shows a performance equivalent to models 2 to 4 times its size. It would be very interesting to have more information on the training procedure for Phi or an independant reproduction from an external team with open training recipes/datasets.</p>
|
216 |
+
<p>Here is a detail of the changes in rankings:</p>
|
217 |
+
|
218 |
+
<table>
|
219 |
<tr>
|
220 |
<th>Rank</th>
|
221 |
<th>Leaderboard v1</th>
|
|
|
285 |
</div>
|
286 |
</div>
|
287 |
|
288 |
+
<p>Let’s finish with some food for thoughts and advices from the maintainer’s team.</p>
|
|
|
289 |
|
290 |
+
<h3>Which evaluations should you pay most attention to?</h3>
|
291 |
+
<p>Depending on your practical use case, you should probably focus on various aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you might be more interested in specific capabilities.</p>
|
292 |
|
293 |
+
<p>In particular, we observed that our different evaluations results are not always correlated with one another as illustrated on this correlation matrice:</p>
|
294 |
+
|
295 |
<div class="main-plot-container">
|
296 |
<figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
|
297 |
<div id="heatmap">
|
|
|
299 |
</div>
|
300 |
</div>
|
301 |
|
302 |
+
<p>As you can see, MMLU-Pro, BBH and ARC-challenge are rather well correlated. As it’s been also noted by other teams, these 3 benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
|
303 |
+
<p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
|
|
|
304 |
|
305 |
<div class="l-body">
|
306 |
<figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
|
|
|
308 |
</div>
|
309 |
|
310 |
|
311 |
+
<p>If you are especially interested in model knowledge rather than alignment or chat capabilities, the most relevant evaluations for you will likely be MMLU-Pro and GPQA.</p>
|
312 |
+
<p>Let’s see how performances on these updated benchmarks compare to our evaluation on the previous version of the leaderboard.</p>
|
313 |
|
314 |
<div class="l-body">
|
315 |
<figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
|
|
|
317 |
</div>
|
318 |
|
319 |
|
320 |
+
<p>As we can see, both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with MMLU scores from the Open LLM Leaderboard v1. However, we note that the scores are overall much lower since GPQA is much harder. There is thus quite some room for model to improve – which is great news :)</p>
|
321 |
+
<p>MATH-Lvl5 is, obviously, interesting for people focusing on math capabilities. The results on this benchmark are generally correlated with performance on GSM8K, except for some outliers as we can see on the following figure.</p>
|
322 |
|
323 |
<div class="l-body">
|
324 |
<figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
|
325 |
<div id="math"></div>
|
326 |
</div>
|
327 |
|
328 |
+
<p>In the green box, we highlight models which previously scored 0 on GSM8K due to evaluation limitations mentioned above, but now have very decent scores on the new benchmark MATH-Level5. These models (mostly from 01-ai) were quite strongly penalized by the previous format. In the red box we show models which scored high on GSM8K but are now almost at 0 on MATH-Lvl5. From our current dive in the outputs and behaviors of these models, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).</p>
|
329 |
+
<p>This observation seems to imply that some chat finetuning procedures can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
|
330 |
<p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
|
331 |
+
<p>Let’s conclude with a look at the future of Open LLM leaderboard!</p>
|
332 |
|
333 |
<h2>What’s next?</h2>
|
334 |
+
<p>Much like the first version of the Open LLM Leaderboard pushed a community approach to model development during the past year, we hope that the new version 2 will be a milestone of open and reproducible model evaluations.</p>
|
335 |
+
<p>Because backward compatibility and open knowledge is important, you’ll still be able to find all the previous results archived in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>!</p>
|
336 |
+
<p>Taking a step back to look at the evolution of all the 7400 evaluated models on the Open LLM Leaderboard through time, we can note some much wider trends in the field! For instance we see a strong trend going from larger (red dots) models to smaller (yellow dots) models, while at the same time improving performance.</p>
|
337 |
|
338 |
<div class="l-body">
|
339 |
<figure><img src="assets/images/timewise_analysis_full.png"/></figure>
|
340 |
<div id="timewise"></div>
|
341 |
</div>
|
342 |
|
343 |
+
<p>This is great news for the field as smaller models are much easier to embedded as well as much more energy/memory/compute efficient and we hope to observe a similar pattern of progress in the new version of the leaderboard Given our harder benchmarks, our starting point is for now much lower (black dots) so let’s see where the field take us in a few months from now :)</p>
|
344 |
+
<p>If you’ve read to this point, thanks a lot, we hope you’ll enjoy this new version of the Open LLM Leaderboard. May the open-source winds push our LLMs boats to sail far away on the sea of deep learning.</p>
|
345 |
+
|
|
|
346 |
|
347 |
</d-article>
|
348 |
|