Spaces:

open-llm-leaderboard
/

blog

Running

Nathan Habib commited on 8 days ago

Commit

0c1bf43

•

2 Parent(s): 7375a0d dae2e6c

Merge branch 'main' of hf.co:spaces/open-llm-leaderboard/blog

Files changed (1) hide show

src/index.html CHANGED Viewed

@@ -145,7 +145,7 @@
             <aside>
                 <p><em>Should we have included more evaluations?</em></p>
-                <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
             </aside>
             <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
@@ -207,6 +207,7 @@
     <h2>New leaderboard, new results!</h2>
         <p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
         <h3>What do the rankings look like?</h3>

             <aside>
                 <p><em>Should we have included more evaluations?</em></p>
+                <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. We wanted to include many other evaluations (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
             </aside>
             <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
     <h2>New leaderboard, new results!</h2>
         <p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
+        <aside>As the cluster has been extremely full, models of more than 140B parameters (such as Falcon-180B and BLOOM) will be run a bit later. </aside>
         <h3>What do the rankings look like?</h3>