Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

Clémentine commited on 28 days ago

Commit

fe3bc07

•

1 Parent(s): 77043b2

fixed table of contents and figures

Browse files

Files changed (7) hide show

assets/images/saturation.png +0 -0
dist/assets/images/saturation.png +0 -0
dist/distill.bundle.js +0 -0
dist/distill.bundle.js.map +0 -0
dist/index.html +107 -101
src/distill.js +1 -1
src/index.html +107 -101

assets/images/saturation.png ADDED Viewed

dist/assets/images/saturation.png ADDED Viewed

dist/distill.bundle.js CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/distill.bundle.js.map CHANGED Viewed

The diff for this file is too large to render. See raw diff

dist/index.html CHANGED Viewed

@@ -49,7 +49,7 @@
 </d-front-matter>
 <d-title>
     <h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1>
-    <div id="title-plot" class="main-plot-container l-screen">
         <figure>
             <img src="assets/images/banner.png" alt="Banner">
         </figure>
@@ -80,13 +80,13 @@
     <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
-        <h3>The need for a more challenging leaderboard</h3>
             <p>
             Over the past year, the benchmarks we were using got overused/saturated:
             </p>
-            <div class="main-plot-container">
                 <figure><img src="assets/images/saturation.png"/></figure>
                 <div id="saturation"></div>
             </div>
@@ -146,7 +146,10 @@
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
-            <div class="main-plot-container">
                 <figure><img src="assets/images/normalized_vs_raw_scores.png"/></figure>
                 <div id="normalisation"></div>
             </div>
@@ -196,120 +199,123 @@
             <p>We’ve also decided to remove the FAQ and About tabs from the Leaderboard, as we noticed that a number of users were not finding the tabs, and it was crowding the interface. They are now in their own dedicated documentation page, that you can find here! # Results!</p>
             <p>For the version 2, we made the choice to initialize the leaderboard with the maintainer’s choice models only to start. But as always, submissions are open!</p>
-    <h2>What about the rankings?</h2>
         <p>When looking at the top 10 of the Open LLM Leaderboard, and comparing the v2 and v1, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
-        <table>
-            <tr>
-                <td>Rank</td>
-                <td>Leaderboard v1</td>
-                <td>Leaderboard v2</td>
-            </tr>
-            <tr>
-                <td>1</td>
-                <td><b>abacusai/Smaug-72B-v0.1</b></td>
-                <td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
-            </tr>
-            <tr>
-                <td>2</td>
-                <td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
-                <td><em>microsoft/Phi-3-medium-4k-instruct</em></td>
-            </tr>
-            <tr>
-                <td>3</td>
-                <td><b>abacusai/Smaug-34B-v0.1</b></td>
-                <td>01-ai/Yi-1.5-34B-Chat</td>
-            </tr>
-            <tr>
-                <td>4</td>
-                <td>mlabonne/AlphaMonarch-7B</td>
-                <td><b>abacusai/Smaug-72B-v0.1</b></td>
-            </tr>
-            <tr>
-                <td>5</td>
-                <td>mlabonne/Beyonder-4x7B-v3</td>
-                <td><b>CohereForAI/c4ai-command-r-plus<b></td>
-            </tr>
-            <tr>
-                <td>6</td>
-                <td><b>01-ai/Yi-1.5-34B-Chat</b></td>
-                <td>Qwen/Qwen1.5-110B-Chat</td>
-            </tr>
-            <tr>
-                <td>7</td>
-                <td><b>CohereForAI/c4ai-command-r-plus</b></td>
-                <td>NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO</td>
-            </tr>
-            <tr>
-                <td>8</td>
-                <td>upstage/SOLAR-10.7B-Instruct-v1.0</td>
-                <td><b>meta-llama/Meta-Llama-3-70B</b></td>
-            </tr>
-            <tr>
-                <td>9</td>
-                <td><b>meta-llama/Meta-Llama-3-70B</b></td>
-                <td>01-ai/Yi-1.5-9B-Chat</td>
-            </tr>
-            <tr>
-                <td>10</td>
-                <td>01-ai/Yi-1.5-34B</td>
-                <td>01-ai/Yi-1.5-34B-32K</td>
-            </tr>
-        </table>
-        <p>We’ve been particularly impressed by Llama-70B-instruct, who is the best model across many evaluations (though it has 15 points less than it’s base counterpart on GPQA - does instruct tuning remove knowledge?).</p>
-        <p>Interestingly, a new challenger climbed the ranks to arrive in 2nd place despite its smaller size: Phi-3-medium-4K-instruct, only 13B parameters but a performance equivalent to models 2 to 4 times its size.</p>
-        <p>We also provide the most important top and bottom ranking changes.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
-            <div id="ranking"></div>
-        </div>
-    <h2>Which evaluations should you pay most attention to?</h2>
-        <p>Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.</p>
-        <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
-            <div id="heatmap"></div>
-        </div>
-        <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
-        <p>IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
-            <div id="ifeval"></div>
-        </div>
-        <p>If you are more interested in knowledge than alignment with human preference, the most relevant evaluations for you would be MMLU-Pro and GPQA.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
-            <div id="mmlu"></div>
-        </div>
-        <p>Both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with reference MMLU scores from the Open LLM Leaderboard v1. However, since GPQA is much harder, the scores are overall much lower.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
-            <div id="math"></div>
-        </div>
-        <p>MATH-Lvl5 is, obviously, interesting for people concerned with math capabilities. Its results are correlated with GSM8K, except for some outliers. In the green box are models which scored 0 on GSM8K in the first leaderboard, but now have good scores on MATH-Level5 (mostly models from 01-ai) - it’s likely they were penalized by the previous format and stop tokens. In the red box are models which scored high on GSM8K but are now at 0 on MATH-Lvl5.From our current observations, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).This seems to imply that some chat tuning can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
-        <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
     <h2>What’s next?</h2>
         <p>Much like the v1 drove model development during the last year, especially for the community, we hope that the v2 will be a cornerstone of model evaluations.</p>
         <p>You’ll still be able to find all the v1 results in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>, and we are preparing an in depth blog about what we learned while taking care of the leaderboard!</p>
-        <div class="main-plot-container">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
             <div id="timewise"></div>
         </div>
@@ -329,7 +335,7 @@
     const article = document.querySelector('d-article');
     const toc = document.querySelector('d-contents');
     if (toc) {
-        const headings = article.querySelectorAll('h3, h3, h4');
         let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
         let prevLevel = 0;
@@ -341,7 +347,7 @@
             el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
             const link = '<a target="_self" href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
-            const level = el.tagName === 'h3' ? 0 : (el.tagName === 'H3' ? 1 : 2);
             while (prevLevel < level) {
                 ToC += '<ul>'
                 prevLevel++;

 </d-front-matter>
 <d-title>
     <h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1>
+    <div id="title-plot" class="l-body l-screen">
         <figure>
             <img src="assets/images/banner.png" alt="Banner">
         </figure>
     <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
+    <h3>The need for a more challenging leaderboard</h3>
             <p>
             Over the past year, the benchmarks we were using got overused/saturated:
             </p>
+            <div class="l-body">
                 <figure><img src="assets/images/saturation.png"/></figure>
                 <div id="saturation"></div>
             </div>
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
+            <div class="l-body">
+                <!--todo: if you use an interactive visualisation instead of a plot,
+                    replace the class `l-body` by `main-plot-container` and import your interactive plot in the
+                    below div id, while leaving the image as such. -->
                 <figure><img src="assets/images/normalized_vs_raw_scores.png"/></figure>
                 <div id="normalisation"></div>
             </div>
             <p>We’ve also decided to remove the FAQ and About tabs from the Leaderboard, as we noticed that a number of users were not finding the tabs, and it was crowding the interface. They are now in their own dedicated documentation page, that you can find here! # Results!</p>
             <p>For the version 2, we made the choice to initialize the leaderboard with the maintainer’s choice models only to start. But as always, submissions are open!</p>
+    <h2>New leaderboard, new results!</h2>
+        <h3>What about the rankings?</h3>
         <p>When looking at the top 10 of the Open LLM Leaderboard, and comparing the v2 and v1, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
+            <table>
+                <tr>
+                    <th>Rank</th>
+                    <th>Leaderboard v1</th>
+                    <th>Leaderboard v2</th>
+                </tr>
+                <tr>
+                    <td>⭐</td>
+                    <td><b>abacusai/Smaug-72B-v0.1</b></td>
+                    <td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
+                </tr>
+                <tr>
+                    <td>2</td>
+                    <td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
+                    <td><em>microsoft/Phi-3-medium-4k-instruct</em></td>
+                </tr>
+                <tr>
+                    <td>3</td>
+                    <td><b>abacusai/Smaug-34B-v0.1</b></td>
+                    <td>01-ai/Yi-1.5-34B-Chat</td>
+                </tr>
+                <tr>
+                    <td>4</td>
+                    <td>mlabonne/AlphaMonarch-7B</td>
+                    <td><b>abacusai/Smaug-72B-v0.1</b></td>
+                </tr>
+                <tr>
+                    <td>5</td>
+                    <td>mlabonne/Beyonder-4x7B-v3</td>
+                    <td><b>CohereForAI/c4ai-command-r-plus<b></td>
+                </tr>
+                <tr>
+                    <td>6</td>
+                    <td><b>01-ai/Yi-1.5-34B-Chat</b></td>
+                    <td>Qwen/Qwen1.5-110B-Chat</td>
+                </tr>
+                <tr>
+                    <td>7</td>
+                    <td><b>CohereForAI/c4ai-command-r-plus</b></td>
+                    <td>NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO</td>
+                </tr>
+                <tr>
+                    <td>8</td>
+                    <td>upstage/SOLAR-10.7B-Instruct-v1.0</td>
+                    <td><b>meta-llama/Meta-Llama-3-70B</b></td>
+                </tr>
+                <tr>
+                    <td>9</td>
+                    <td><b>meta-llama/Meta-Llama-3-70B</b></td>
+                    <td>01-ai/Yi-1.5-9B-Chat</td>
+                </tr>
+                <tr>
+                    <td>10</td>
+                    <td>01-ai/Yi-1.5-34B</td>
+                    <td>01-ai/Yi-1.5-34B-32K</td>
+                </tr>
+            </table>
+            <p>We’ve been particularly impressed by Llama-70B-instruct, who is the best model across many evaluations (though it has 15 points less than it’s base counterpart on GPQA - does instruct tuning remove knowledge?).</p>
+            <p>Interestingly, a new challenger climbed the ranks to arrive in 2nd place despite its smaller size: Phi-3-medium-4K-instruct, only 13B parameters but a performance equivalent to models 2 to 4 times its size.</p>
+            <p>We also provide the most important top and bottom ranking changes.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
+                <div id="ranking"></div>
+            </div>
+        <h3>Which evaluations should you pay most attention to?</h3>
+            <p>Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.</p>
+            <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
+                <div id="heatmap"></div>
+            </div>
+            <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
+            <p>IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
+                <div id="ifeval"></div>
+            </div>
+            <p>If you are more interested in knowledge than alignment with human preference, the most relevant evaluations for you would be MMLU-Pro and GPQA.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
+                <div id="mmlu"></div>
+            </div>
+            <p>Both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with reference MMLU scores from the Open LLM Leaderboard v1. However, since GPQA is much harder, the scores are overall much lower.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
+                <div id="math"></div>
+            </div>
+            <p>MATH-Lvl5 is, obviously, interesting for people concerned with math capabilities. Its results are correlated with GSM8K, except for some outliers. In the green box are models which scored 0 on GSM8K in the first leaderboard, but now have good scores on MATH-Level5 (mostly models from 01-ai) - it’s likely they were penalized by the previous format and stop tokens. In the red box are models which scored high on GSM8K but are now at 0 on MATH-Lvl5.From our current observations, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).This seems to imply that some chat tuning can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
+            <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
     <h2>What’s next?</h2>
         <p>Much like the v1 drove model development during the last year, especially for the community, we hope that the v2 will be a cornerstone of model evaluations.</p>
         <p>You’ll still be able to find all the v1 results in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>, and we are preparing an in depth blog about what we learned while taking care of the leaderboard!</p>
+        <div class="l-body">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
             <div id="timewise"></div>
         </div>
     const article = document.querySelector('d-article');
     const toc = document.querySelector('d-contents');
     if (toc) {
+        const headings = article.querySelectorAll('h2, h3, h4');
         let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
         let prevLevel = 0;
             el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
             const link = '<a target="_self" href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
+            const level = el.tagName === 'H2' ? 0 : (el.tagName === 'H3' ? 1 : 2);
             while (prevLevel < level) {
                 ToC += '<ul>'
                 prevLevel++;

src/distill.js CHANGED Viewed

@@ -2102,7 +2102,7 @@ d-appendix > distill-appendix {
       </div>
       <div >
           <h3>Published</h3>
-          <div>May 31, 2024</div>
       </div>
     </div>
 `;

       </div>
       <div >
           <h3>Published</h3>
+          <div>Jun 26, 2024</div>
       </div>
     </div>
 `;

src/index.html CHANGED Viewed

@@ -49,7 +49,7 @@
 </d-front-matter>
 <d-title>
     <h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1>
-    <div id="title-plot" class="main-plot-container l-screen">
         <figure>
             <img src="assets/images/banner.png" alt="Banner">
         </figure>
@@ -80,13 +80,13 @@
     <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
-        <h3>The need for a more challenging leaderboard</h3>
             <p>
             Over the past year, the benchmarks we were using got overused/saturated:
             </p>
-            <div class="main-plot-container">
                 <figure><img src="assets/images/saturation.png"/></figure>
                 <div id="saturation"></div>
             </div>
@@ -146,7 +146,10 @@
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
-            <div class="main-plot-container">
                 <figure><img src="assets/images/normalized_vs_raw_scores.png"/></figure>
                 <div id="normalisation"></div>
             </div>
@@ -196,120 +199,123 @@
             <p>We’ve also decided to remove the FAQ and About tabs from the Leaderboard, as we noticed that a number of users were not finding the tabs, and it was crowding the interface. They are now in their own dedicated documentation page, that you can find here! # Results!</p>
             <p>For the version 2, we made the choice to initialize the leaderboard with the maintainer’s choice models only to start. But as always, submissions are open!</p>
-    <h2>What about the rankings?</h2>
         <p>When looking at the top 10 of the Open LLM Leaderboard, and comparing the v2 and v1, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
-        <table>
-            <tr>
-                <td>Rank</td>
-                <td>Leaderboard v1</td>
-                <td>Leaderboard v2</td>
-            </tr>
-            <tr>
-                <td>1</td>
-                <td><b>abacusai/Smaug-72B-v0.1</b></td>
-                <td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
-            </tr>
-            <tr>
-                <td>2</td>
-                <td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
-                <td><em>microsoft/Phi-3-medium-4k-instruct</em></td>
-            </tr>
-            <tr>
-                <td>3</td>
-                <td><b>abacusai/Smaug-34B-v0.1</b></td>
-                <td>01-ai/Yi-1.5-34B-Chat</td>
-            </tr>
-            <tr>
-                <td>4</td>
-                <td>mlabonne/AlphaMonarch-7B</td>
-                <td><b>abacusai/Smaug-72B-v0.1</b></td>
-            </tr>
-            <tr>
-                <td>5</td>
-                <td>mlabonne/Beyonder-4x7B-v3</td>
-                <td><b>CohereForAI/c4ai-command-r-plus<b></td>
-            </tr>
-            <tr>
-                <td>6</td>
-                <td><b>01-ai/Yi-1.5-34B-Chat</b></td>
-                <td>Qwen/Qwen1.5-110B-Chat</td>
-            </tr>
-            <tr>
-                <td>7</td>
-                <td><b>CohereForAI/c4ai-command-r-plus</b></td>
-                <td>NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO</td>
-            </tr>
-            <tr>
-                <td>8</td>
-                <td>upstage/SOLAR-10.7B-Instruct-v1.0</td>
-                <td><b>meta-llama/Meta-Llama-3-70B</b></td>
-            </tr>
-            <tr>
-                <td>9</td>
-                <td><b>meta-llama/Meta-Llama-3-70B</b></td>
-                <td>01-ai/Yi-1.5-9B-Chat</td>
-            </tr>
-            <tr>
-                <td>10</td>
-                <td>01-ai/Yi-1.5-34B</td>
-                <td>01-ai/Yi-1.5-34B-32K</td>
-            </tr>
-        </table>
-        <p>We’ve been particularly impressed by Llama-70B-instruct, who is the best model across many evaluations (though it has 15 points less than it’s base counterpart on GPQA - does instruct tuning remove knowledge?).</p>
-        <p>Interestingly, a new challenger climbed the ranks to arrive in 2nd place despite its smaller size: Phi-3-medium-4K-instruct, only 13B parameters but a performance equivalent to models 2 to 4 times its size.</p>
-        <p>We also provide the most important top and bottom ranking changes.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
-            <div id="ranking"></div>
-        </div>
-    <h2>Which evaluations should you pay most attention to?</h2>
-        <p>Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.</p>
-        <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
-            <div id="heatmap"></div>
-        </div>
-        <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
-        <p>IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
-            <div id="ifeval"></div>
-        </div>
-        <p>If you are more interested in knowledge than alignment with human preference, the most relevant evaluations for you would be MMLU-Pro and GPQA.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
-            <div id="mmlu"></div>
-        </div>
-        <p>Both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with reference MMLU scores from the Open LLM Leaderboard v1. However, since GPQA is much harder, the scores are overall much lower.</p>
-        <div class="main-plot-container">
-            <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
-            <div id="math"></div>
-        </div>
-        <p>MATH-Lvl5 is, obviously, interesting for people concerned with math capabilities. Its results are correlated with GSM8K, except for some outliers. In the green box are models which scored 0 on GSM8K in the first leaderboard, but now have good scores on MATH-Level5 (mostly models from 01-ai) - it’s likely they were penalized by the previous format and stop tokens. In the red box are models which scored high on GSM8K but are now at 0 on MATH-Lvl5.From our current observations, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).This seems to imply that some chat tuning can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
-        <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
     <h2>What’s next?</h2>
         <p>Much like the v1 drove model development during the last year, especially for the community, we hope that the v2 will be a cornerstone of model evaluations.</p>
         <p>You’ll still be able to find all the v1 results in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>, and we are preparing an in depth blog about what we learned while taking care of the leaderboard!</p>
-        <div class="main-plot-container">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
             <div id="timewise"></div>
         </div>
@@ -329,7 +335,7 @@
     const article = document.querySelector('d-article');
     const toc = document.querySelector('d-contents');
     if (toc) {
-        const headings = article.querySelectorAll('h3, h3, h4');
         let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
         let prevLevel = 0;
@@ -341,7 +347,7 @@
             el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
             const link = '<a target="_self" href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
-            const level = el.tagName === 'h3' ? 0 : (el.tagName === 'H3' ? 1 : 2);
             while (prevLevel < level) {
                 ToC += '<ul>'
                 prevLevel++;

 </d-front-matter>
 <d-title>
     <h1 class="l-page" style="text-align: center;">Open-LLM performances are plateauing, let’s make it steep again </h1>
+    <div id="title-plot" class="l-body l-screen">
         <figure>
             <img src="assets/images/banner.png" alt="Banner">
         </figure>
     <h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
+    <h3>The need for a more challenging leaderboard</h3>
             <p>
             Over the past year, the benchmarks we were using got overused/saturated:
             </p>
+            <div class="l-body">
                 <figure><img src="assets/images/saturation.png"/></figure>
                 <div id="saturation"></div>
             </div>
         <h3>Reporting a fairer average for ranking: using normalized scores</h3>
             <p>We decided to change the final grade for the model. Instead of summing each benchmark output score, we normalized these scores between the random baseline (0 points) and the maximal possible score (100 points). We then average all normalized scores to get the final average score and compute final rankings. As a matter of example, in a benchmark containing two-choices for each questions, a random baseline will get 50 points (out of 100 points). If you use a random number generator, you will thus likely get around 50 on this evaluation. This means that scores are actually always between 50 (the lowest score you reasonably get if the benchmark is not adversarial) and 100. We therefore change the range so that a 50 on the raw score is a 0 on the normalized score. Note that for generative evaluations (like IFEval or MATH), it doesn’t change anything.</p>
+            <div class="l-body">
+                <!--todo: if you use an interactive visualisation instead of a plot,
+                    replace the class `l-body` by `main-plot-container` and import your interactive plot in the
+                    below div id, while leaving the image as such. -->
                 <figure><img src="assets/images/normalized_vs_raw_scores.png"/></figure>
                 <div id="normalisation"></div>
             </div>
             <p>We’ve also decided to remove the FAQ and About tabs from the Leaderboard, as we noticed that a number of users were not finding the tabs, and it was crowding the interface. They are now in their own dedicated documentation page, that you can find here! # Results!</p>
             <p>For the version 2, we made the choice to initialize the leaderboard with the maintainer’s choice models only to start. But as always, submissions are open!</p>
+    <h2>New leaderboard, new results!</h2>
+        <h3>What about the rankings?</h3>
         <p>When looking at the top 10 of the Open LLM Leaderboard, and comparing the v2 and v1, 5 models appear to have a relatively stable ranking: Meta’s Llama3-70B, both instruct and base version, 01-ai’s Yi-1.5-34B, chat version, Cohere’s Command R + model, and lastly Smaug-72B, from AbacusAI.</p>
+            <table>
+                <tr>
+                    <th>Rank</th>
+                    <th>Leaderboard v1</th>
+                    <th>Leaderboard v2</th>
+                </tr>
+                <tr>
+                    <td>⭐</td>
+                    <td><b>abacusai/Smaug-72B-v0.1</b></td>
+                    <td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
+                </tr>
+                <tr>
+                    <td>2</td>
+                    <td><b>meta-llama/Meta-Llama-3-70B-Instruct</b></td>
+                    <td><em>microsoft/Phi-3-medium-4k-instruct</em></td>
+                </tr>
+                <tr>
+                    <td>3</td>
+                    <td><b>abacusai/Smaug-34B-v0.1</b></td>
+                    <td>01-ai/Yi-1.5-34B-Chat</td>
+                </tr>
+                <tr>
+                    <td>4</td>
+                    <td>mlabonne/AlphaMonarch-7B</td>
+                    <td><b>abacusai/Smaug-72B-v0.1</b></td>
+                </tr>
+                <tr>
+                    <td>5</td>
+                    <td>mlabonne/Beyonder-4x7B-v3</td>
+                    <td><b>CohereForAI/c4ai-command-r-plus<b></td>
+                </tr>
+                <tr>
+                    <td>6</td>
+                    <td><b>01-ai/Yi-1.5-34B-Chat</b></td>
+                    <td>Qwen/Qwen1.5-110B-Chat</td>
+                </tr>
+                <tr>
+                    <td>7</td>
+                    <td><b>CohereForAI/c4ai-command-r-plus</b></td>
+                    <td>NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO</td>
+                </tr>
+                <tr>
+                    <td>8</td>
+                    <td>upstage/SOLAR-10.7B-Instruct-v1.0</td>
+                    <td><b>meta-llama/Meta-Llama-3-70B</b></td>
+                </tr>
+                <tr>
+                    <td>9</td>
+                    <td><b>meta-llama/Meta-Llama-3-70B</b></td>
+                    <td>01-ai/Yi-1.5-9B-Chat</td>
+                </tr>
+                <tr>
+                    <td>10</td>
+                    <td>01-ai/Yi-1.5-34B</td>
+                    <td>01-ai/Yi-1.5-34B-32K</td>
+                </tr>
+            </table>
+            <p>We’ve been particularly impressed by Llama-70B-instruct, who is the best model across many evaluations (though it has 15 points less than it’s base counterpart on GPQA - does instruct tuning remove knowledge?).</p>
+            <p>Interestingly, a new challenger climbed the ranks to arrive in 2nd place despite its smaller size: Phi-3-medium-4K-instruct, only 13B parameters but a performance equivalent to models 2 to 4 times its size.</p>
+            <p>We also provide the most important top and bottom ranking changes.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/ranking_top10_bottom10.png"/></figure>
+                <div id="ranking"></div>
+            </div>
+        <h3>Which evaluations should you pay most attention to?</h3>
+            <p>Depending on your use case, you should look at different aspects of the leaderboard. The overall ranking will tell you which model is better on average, but you could be interested in specific capabilities instead.</p>
+            <p>For example, our different evaluations results are not all correlated with one another, which is expected.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/v2_correlation_heatmap.png"/></figure>
+                <div id="heatmap"></div>
+            </div>
+            <p>MMLU-Pro, BBH and ARC-challenge are well correlated together. It is known that these 3 are well correlated with human preference (as they tend to align with human judgment on LMSys’s chatbot arena).</p>
+            <p>IFEval is also linked to chat-related capabilities, since it investigates whether models can follow precise instructions or not. However, contrary to the others, its format discriminates against chat or instruction tuned models, with pretrained models having a harder time performing as well.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/ifeval_score_per_model_type.png"/></figure>
+                <div id="ifeval"></div>
+            </div>
+            <p>If you are more interested in knowledge than alignment with human preference, the most relevant evaluations for you would be MMLU-Pro and GPQA.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/v2_fn_of_mmlu.png"/></figure>
+                <div id="mmlu"></div>
+            </div>
+            <p>Both MMLU-PRO scores (in orange) and GPQA scores (in yellow) are reasonably correlated with reference MMLU scores from the Open LLM Leaderboard v1. However, since GPQA is much harder, the scores are overall much lower.</p>
+            <div class="l-body">
+                <figure><img src="assets/images/math_fn_gsm8k.png"/></figure>
+                <div id="math"></div>
+            </div>
+            <p>MATH-Lvl5 is, obviously, interesting for people concerned with math capabilities. Its results are correlated with GSM8K, except for some outliers. In the green box are models which scored 0 on GSM8K in the first leaderboard, but now have good scores on MATH-Level5 (mostly models from 01-ai) - it’s likely they were penalized by the previous format and stop tokens. In the red box are models which scored high on GSM8K but are now at 0 on MATH-Lvl5.From our current observations, these would appear to be mostly chat versions of base models (where the base models score higher on MATH!).This seems to imply that some chat tuning can impair math capabilities (from our observations, by making models exceedingly verbose).</p>
+            <p>MuSR, our last evaluation, is particularly interesting for long context models. We’ve observed that the best performers are models with 10K and plus of context size, and it seems discriminative enough to target long context reasoning specifically.</p>
     <h2>What’s next?</h2>
         <p>Much like the v1 drove model development during the last year, especially for the community, we hope that the v2 will be a cornerstone of model evaluations.</p>
         <p>You’ll still be able to find all the v1 results in the <a href="https://huggingface.co/open-llm-leaderboard-old">Open LLM Leaderboard Archive</a>, and we are preparing an in depth blog about what we learned while taking care of the leaderboard!</p>
+        <div class="l-body">
             <figure><img src="assets/images/timewise_analysis_full.png"/></figure>
             <div id="timewise"></div>
         </div>
     const article = document.querySelector('d-article');
     const toc = document.querySelector('d-contents');
     if (toc) {
+        const headings = article.querySelectorAll('h2, h3, h4');
         let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
         let prevLevel = 0;
             el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
             const link = '<a target="_self" href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
+            const level = el.tagName === 'H2' ? 0 : (el.tagName === 'H3' ? 1 : 2);
             while (prevLevel < level) {
                 ToC += '<ul>'
                 prevLevel++;