Spaces:
Running
Running
Clémentine
commited on
Commit
•
d80af64
1
Parent(s):
05d8ce4
removed arc
Browse files- dist/index.html +3 -3
- src/index.html +3 -3
dist/index.html
CHANGED
@@ -123,7 +123,7 @@
|
|
123 |
<li>Evaluation quality:</li>
|
124 |
<ul>
|
125 |
<li>Human review of dataset: MMLU-Pro and GPQA</li>
|
126 |
-
<li>Widespread use in the academic and/or open source community:
|
127 |
</ul>
|
128 |
<li>Reliability and fairness of metrics:</li>
|
129 |
<ul>
|
@@ -137,7 +137,7 @@
|
|
137 |
</ul>
|
138 |
<li>Measuring model skills that are interesting for the community: </li>
|
139 |
<ul>
|
140 |
-
<li>Correlation with human preferences: BBH, IFEval
|
141 |
<li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
|
142 |
</ul>
|
143 |
</ol>
|
@@ -305,7 +305,7 @@
|
|
305 |
</div>
|
306 |
</div>
|
307 |
|
308 |
-
<p>As you can see, MMLU-Pro
|
309 |
<p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
|
310 |
|
311 |
<div class="main-plot-container">
|
|
|
123 |
<li>Evaluation quality:</li>
|
124 |
<ul>
|
125 |
<li>Human review of dataset: MMLU-Pro and GPQA</li>
|
126 |
+
<li>Widespread use in the academic and/or open source community: BBH, IFeval, MATH</li>
|
127 |
</ul>
|
128 |
<li>Reliability and fairness of metrics:</li>
|
129 |
<ul>
|
|
|
137 |
</ul>
|
138 |
<li>Measuring model skills that are interesting for the community: </li>
|
139 |
<ul>
|
140 |
+
<li>Correlation with human preferences: BBH, IFEval</li>
|
141 |
<li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
|
142 |
</ul>
|
143 |
</ol>
|
|
|
305 |
</div>
|
306 |
</div>
|
307 |
|
308 |
+
<p>As you can see, MMLU-Pro and BBH are rather well correlated. As it’s been also noted by other teams, these benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
|
309 |
<p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
|
310 |
|
311 |
<div class="main-plot-container">
|
src/index.html
CHANGED
@@ -123,7 +123,7 @@
|
|
123 |
<li>Evaluation quality:</li>
|
124 |
<ul>
|
125 |
<li>Human review of dataset: MMLU-Pro and GPQA</li>
|
126 |
-
<li>Widespread use in the academic and/or open source community:
|
127 |
</ul>
|
128 |
<li>Reliability and fairness of metrics:</li>
|
129 |
<ul>
|
@@ -137,7 +137,7 @@
|
|
137 |
</ul>
|
138 |
<li>Measuring model skills that are interesting for the community: </li>
|
139 |
<ul>
|
140 |
-
<li>Correlation with human preferences: BBH, IFEval
|
141 |
<li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
|
142 |
</ul>
|
143 |
</ol>
|
@@ -305,7 +305,7 @@
|
|
305 |
</div>
|
306 |
</div>
|
307 |
|
308 |
-
<p>As you can see, MMLU-Pro
|
309 |
<p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
|
310 |
|
311 |
<div class="main-plot-container">
|
|
|
123 |
<li>Evaluation quality:</li>
|
124 |
<ul>
|
125 |
<li>Human review of dataset: MMLU-Pro and GPQA</li>
|
126 |
+
<li>Widespread use in the academic and/or open source community: BBH, IFeval, MATH</li>
|
127 |
</ul>
|
128 |
<li>Reliability and fairness of metrics:</li>
|
129 |
<ul>
|
|
|
137 |
</ul>
|
138 |
<li>Measuring model skills that are interesting for the community: </li>
|
139 |
<ul>
|
140 |
+
<li>Correlation with human preferences: BBH, IFEval</li>
|
141 |
<li>Evaluation of a specific capability we are interested in: MATH, MuSR</li>
|
142 |
</ul>
|
143 |
</ol>
|
|
|
305 |
</div>
|
306 |
</div>
|
307 |
|
308 |
+
<p>As you can see, MMLU-Pro and BBH are rather well correlated. As it’s been also noted by other teams, these benchmarks are also quite correlated with human preference (for instance they tend to align with human judgment on LMSys’s chatbot arena).</p>
|
309 |
<p>Another of our benchmarks, IFEval, is targeting chat capabilities. It investigates whether models can follow precise instructions or not. However, the format used in this benchmark tends to favor chat and instruction tuned models, with pretrained models having a harder time reaching high performances.</p>
|
310 |
|
311 |
<div class="main-plot-container">
|