Nathan Habib commited on
Commit
0c1bf43
2 Parent(s): 7375a0d dae2e6c

Merge branch 'main' of hf.co:spaces/open-llm-leaderboard/blog

Browse files
Files changed (1) hide show
  1. src/index.html +2 -1
src/index.html CHANGED
@@ -145,7 +145,7 @@
145
  <aside>
146
  <p><em>Should we have included more evaluations?</em></p>
147
 
148
- <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. There are many other evaluations which we wanted to include (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
149
  </aside>
150
 
151
  <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
@@ -207,6 +207,7 @@
207
 
208
  <h2>New leaderboard, new results!</h2>
209
  <p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
 
210
 
211
  <h3>What do the rankings look like?</h3>
212
 
 
145
  <aside>
146
  <p><em>Should we have included more evaluations?</em></p>
147
 
148
+ <p>We chose to focus on a limited number of evaluations to keep the computation time realistic. We wanted to include many other evaluations (MTBench, AGIEval, DROP, etc), but we are, in the end, still compute constrained - so to keep the evaluation budgets under control we ranked evals according to our above criterion and kept the top ranking benchmarks. This is also why we didn’t select any benchmark requiring the use of another model as a judge.</p>
149
  </aside>
150
 
151
  <p>Selecting new benchmarks is not the whole story, and we also pushed several other interesting improvements to the leaderboard that we’ll now briefly cover.</p>
 
207
 
208
  <h2>New leaderboard, new results!</h2>
209
  <p>We’ve started with adding and evaluating the models in the “maintainer’s highlights” section (cf. above) and are looking forward to the community submitting their new models to this new version of the leaderboard!!</p>
210
+ <aside>As the cluster has been extremely full, models of more than 140B parameters (such as Falcon-180B and BLOOM) will be run a bit later. </aside>
211
 
212
  <h3>What do the rankings look like?</h3>
213