Spaces:
Running
Running
Commit
•
bcf8b15
1
Parent(s):
3685542
Minor grammar changes
Browse files- src/index.html +14 -17
src/index.html
CHANGED
@@ -55,28 +55,25 @@
|
|
55 |
<d-contents>
|
56 |
</d-contents>
|
57 |
|
58 |
-
<p>Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago
|
59 |
-
It was a nearly
|
60 |
-
just using optimized prompts or evaluation setup to give best chances to the models. They therefore decided to create a place where reference models would be
|
61 |
-
evaluated in the exact same setup (same questions, asked in the same order,
|
62 |
Open LLM Leaderboard was born!</p>
|
63 |
|
64 |
-
<p> Following a series of highly
|
65 |
|
66 |
-
<p>
|
67 |
<ul>
|
68 |
-
<li> Find state-of-the-art open
|
69 |
-
<li> Evaluate their
|
70 |
</ul>
|
71 |
|
72 |
-
<p> However, with success, both in the leaderboard and the increasing performances of the models came challenges
|
73 |
-
|
74 |
-
<p>Here is why we think a new leaderboard was needed 👇</p>
|
75 |
-
|
76 |
-
|
77 |
-
<h2>Harder, better, faster, stronger: Introducing the Leaderboard v2</h2>
|
78 |
|
|
|
79 |
|
|
|
80 |
|
81 |
<h3>The need for a more challenging leaderboard</h3>
|
82 |
|
@@ -91,9 +88,9 @@
|
|
91 |
</div>
|
92 |
|
93 |
<ol>
|
94 |
-
<li>They became too easy for models. For instance on HellaSwag, MMLU and ARC
|
95 |
-
<li>Some newer models also showed signs of contamination. By this we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting general
|
96 |
-
<li>Some benchmarks contained errors
|
97 |
</ol>
|
98 |
|
99 |
<p>We thus chose to completely change the evaluations we are running for the Open LLM Leaderboard v2!</p>
|
|
|
55 |
<d-contents>
|
56 |
</d-contents>
|
57 |
|
58 |
+
<p>Evaluating and comparing LLMs is hard. Our RLHF team realized this a year ago when they wanted to reproduce and compare results from several published models.
|
59 |
+
It was a nearly impossible task: scores in papers or marketing releases were given without any reproducible code, sometimes doubtful, but in most cases,
|
60 |
+
just using optimized prompts or evaluation setup to give the best chances to the models. They therefore decided to create a place where reference models would be
|
61 |
+
evaluated in the exact same setup (same questions, asked in the same order, etc.) to gather completely reproducible and comparable results; and that’s how the
|
62 |
Open LLM Leaderboard was born!</p>
|
63 |
|
64 |
+
<p> Following a series of highly visible model releases, it became a widely used resource in the ML community and beyond, visited by more than 2 million unique people over the last 10 months.</p>
|
65 |
|
66 |
+
<p> Around 300,000 community members use and collaborate on it monthly through submissions and discussions, usually to: </p>
|
67 |
<ul>
|
68 |
+
<li> Find state-of-the-art open-source releases as the leaderboard provides reproducible scores separating marketing fluff from actual progress in the field.</li>
|
69 |
+
<li> Evaluate their work, be it pretraining or finetuning, comparing methods in the open and to the best existing models, and earning public recognition.</li>
|
70 |
</ul>
|
71 |
|
72 |
+
<p> However, with success, both in the leaderboard and the increasing performances of the models came challenges. After one intense year and a lot of community feedback, we thought it was time for an upgrade! Therefore, we’re introducing the Open LLM Leaderboard v2!</p>
|
|
|
|
|
|
|
|
|
|
|
73 |
|
74 |
+
<p>Here is why we think a new leaderboard is needed 👇</p>
|
75 |
|
76 |
+
<h2>Harder, better, faster, stronger: Introducing the LLM Leaderboard v2</h2>
|
77 |
|
78 |
<h3>The need for a more challenging leaderboard</h3>
|
79 |
|
|
|
88 |
</div>
|
89 |
|
90 |
<ol>
|
91 |
+
<li>They became too easy for models. For instance, models on HellaSwag, MMLU, and ARC are now reaching baseline human performance, a phenomenon called saturation.</li>
|
92 |
+
<li>Some newer models also showed signs of contamination. By this, we mean that models were possibly trained on benchmark data or on data very similar to benchmark data. As such, some scores stopped reflecting the general performance of the model and started to overfit on some evaluation datasets instead of reflecting the more general performance of the task being tested. This was, in particular, the case for GSM8K and TruthfulQA, which were included in some instruction fine-tuning sets.</li>
|
93 |
+
<li>Some benchmarks contained errors. MMLU was recently investigated in depth by several groups, which surfaced mistakes in its responses and proposed new versions. Another example was that GSM8K used a specific end-of-generation token (:), which unfairly pushed down the performance of many verbose models.</li>
|
94 |
</ol>
|
95 |
|
96 |
<p>We thus chose to completely change the evaluations we are running for the Open LLM Leaderboard v2!</p>
|