Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

osanseviero HF staff commited on 7 days ago

Commit

fb5b21a

•

1 Parent(s): 27dfd39

Minor changes (WIP)

Browse files

Files changed (1) hide show

src/index.html +11 -12

src/index.html CHANGED Viewed

@@ -96,21 +96,20 @@
             <p>We thus chose to completely change the evaluations we are running for the Open LLM Leaderboard v2!</p>
         <h3>Rebooting our evaluation selection</h3>
-            <p>We started looking for new benchmarks with uncontaminated, high quality datasets, making use of reliable metrics, and measuring model capabilities of interest.</p>
-            <p>We decided to cover the following general tasks: knowledge testing (📚), reasoning on short and long contexts (💭), complex mathematical abilities, and tasks well correlated with human preference (🤝), like instruction following.</p>
-            <p>We cover these tasks with 6 benchmarks. Let us present them briefly:</p>
-            <p>📚 <strong>MMLU-Pro</strong> (Massive Multitask Language Understanding - Pro version, <a href="https://arxiv.org/abs/2406.01574">paper</a>). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities as well as the increase of contamination). MMLU-Pro presents the models with 10 choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is higher quality than the original, and (for the moment) harder.</p>
-            <p>📚 <strong>GPQA</strong> (Google-Proof Q&amp;A Benchmark, <a href="https://arxiv.org/abs/2311.12022">paper</a>). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, …) to be hard to answer by laypersons, but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce the risks of contamination. (This is also why we don’t provide a plain text example from this dataset, as requested by the authors in the paper).</p>
-            <p>💭<strong>MuSR</strong> (Multistep Soft Reasoning, <a href="https://arxiv.org/abs/2310.16049">paper</a>). MuSR is a very fun new dataset, made of algorithmically generated complex problems of around 1K words in length. Problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long range context parsing. Few models score better than random performance.</p>
-            <p>🧮 <strong>MATH</strong> (Mathematics Aptitude Test of Heuristics, Level 5 subset, <a href="https://arxiv.org/abs/2103.03874">paper</a>). MATH is a compilation of high-school level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.</p>
-            <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
-            <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
             <gradio-app src="https://open-llm-leaderboard-sample_viewer.hf.space"></gradio-app>
         <h3>Why did we choose these subsets?</h3>
-            <p>In summary, our criterion were: </p>
             <ol>
                 <li>Evaluation quality:</li>
                 <ul>
@@ -119,8 +118,8 @@
                 </ul>
                 <li>Reliability and fairness of metrics:</li>
                 <ul>
-                    <li>Multichoice evaluations are in general fair across models.</li>
-                    <li>Generative evaluations should either constrain the format very much (like MATH), or use very unambiguous metrics (like IFEval) or post processing (like BBH) to extract the correct answers.</li>
                 </ul>
                 <li>General absence of contamination in models as of today:</li>
                 <ul>

             <p>We thus chose to completely change the evaluations we are running for the Open LLM Leaderboard v2!</p>
         <h3>Rebooting our evaluation selection</h3>
+            <p>We started looking for new benchmarks with uncontaminated, high-quality datasets, using reliable metrics and measuring model capabilities of interest.</p>            <p>We decided to cover the following general tasks: knowledge testing (📚), reasoning on short and long contexts (💭), complex mathematical abilities, and tasks well correlated with human preference (🤝), like instruction following.</p>
+            <p>We cover these tasks with six benchmarks. Let us present them briefly:</p>
+            <p>📚 <strong>MMLU-Pro</strong> (Massive Multitask Language Understanding - Pro version, <a href="https://arxiv.org/abs/2406.01574">paper</a>). MMLU-Pro is a refined version of the MMLU dataset. MMLU has been the reference multichoice knowledge dataset. However, recent research showed that it is both noisy (some questions are unanswerable) and now too easy (through the evolution of model capabilities and increased contamination). MMLU-Pro presents the models with ten choices instead of 4, requires reasoning on more questions, and has been expertly reviewed to reduce the amount of noise. It is of higher quality than the original and harder.</p>
+            <p>📚 <strong>GPQA</strong> (Google-Proof Q&amp;A Benchmark, <a href="https://arxiv.org/abs/2311.12022">paper</a>). GPQA is an extremely hard knowledge dataset, where questions were designed by domain experts in their field (PhD-level in biology, physics, chemistry, etc.) to be hard to answer by laypersons but (relatively) easy for experts. Questions have gone through several rounds of validation to ensure both difficulty and factuality. The dataset is also only accessible through gating mechanisms, which should reduce contamination risks. (This is also why we don’t provide a plain text example from this dataset, as requested by the authors in the paper).</p>
+            <p><strong>MuSR</strong> (Multistep Soft Reasoning, <a href="https://arxiv.org/abs/2310.16049">paper</a>). MuSR is a very fun new dataset made of algorithmically generated complex problems of around 1K words in length. The problems are either murder mysteries, object placement questions, or team allocation optimizations. To solve these, the models must combine reasoning and very long-range context parsing. Few models score better than random performance.</p>
+            <p>🧮 <strong>MATH</strong> (Mathematics Aptitude Test of Heuristics, Level 5 subset, <a href="https://arxiv.org/abs/2103.03874">paper</a>). MATH is a compilation of high-school-level competition problems gathered from several sources, formatted consistently using Latex for equations and Asymptote for figures. Generations must fit a very specific output format. We keep only the hardest questions.</p>
+            <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>).  IFEval is a fairly interesting dataset that tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions rather than the actual contents generated, allowing strict and rigorous metrics to be used.</p>
+            <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, and 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, SVG for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been, on average, well correlated with human preference. We expect this dataset to provide exciting insights into specific capabilities which could interest people.</p>
             <gradio-app src="https://open-llm-leaderboard-sample_viewer.hf.space"></gradio-app>
         <h3>Why did we choose these subsets?</h3>
+            <p>In summary, our criteria were: </p>
             <ol>
                 <li>Evaluation quality:</li>
                 <ul>
                 </ul>
                 <li>Reliability and fairness of metrics:</li>
                 <ul>
+                    <li>Multichoice evaluations are, in general, fair across models.</li>
+                    <li>Generative evaluations should either constrain the format very much (like MATH) or use very unambiguous metrics (like IFEval) or post-processing (like BBH) to extract the correct answers.</li>
                 </ul>
                 <li>General absence of contamination in models as of today:</li>
                 <ul>