Spaces:

open-llm-leaderboard
/

blog

Running

App Files Files Community

Clémentine commited on Jun 25

Commit

afd9528

•

1 Parent(s): 00663b9

added iframes to visualizers

Browse files

Files changed (2) hide show

dist/index.html +7 -2
src/index.html +7 -2

dist/index.html CHANGED Viewed

@@ -115,7 +115,7 @@
             <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
             <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
-            <!-- TODO: Interactive prompts exploration -->
         <h3>Why did we choose these subsets?</h3>
             <p>In summary, our criterion were: </p>
@@ -172,6 +172,11 @@
             <p>For the new version of the Open LLM Leaderboard, we have therefore worked together with the amazing EleutherAI team (notably Hailey Schoelkopf, so many, huge kudos!) to update the harness.</p>
             <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
             <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
             <p>This should allow us to keep our version up to date with new features added in the future!</p>
             <p>Enough said on the leaderboard backend and metrics, now let’s turn to the models and model selection/submission.
@@ -202,7 +207,7 @@
         <h3>Better and simpler interface</h3>
             <p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
-            <p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a component that you can re-use yourself in your own leaderboard!</p>
             <p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
     <h2>New leaderboard, new results!</h2>

             <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
             <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
+            <iframe src="https://open-llm-leaderboard/sample_viewer.hf.space"></iframe>
         <h3>Why did we choose these subsets?</h3>
             <p>In summary, our criterion were: </p>
             <p>For the new version of the Open LLM Leaderboard, we have therefore worked together with the amazing EleutherAI team (notably Hailey Schoelkopf, so many, huge kudos!) to update the harness.</p>
             <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
             <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
+            <iframe src="https://open-llm-leaderboard/GenerationVisualizer.hf.space"></iframe>
+            <p>You can explore the visualiser we used here!</p>
             <p>This should allow us to keep our version up to date with new features added in the future!</p>
             <p>Enough said on the leaderboard backend and metrics, now let’s turn to the models and model selection/submission.
         <h3>Better and simpler interface</h3>
             <p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
+            <p>This is thanks to the work of the Gradio team, notably [Freddy Boulton](https://huggingface.co/freddyaboulton), who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a [component](https://huggingface.co/spaces/freddyaboulton/gradio_leaderboard) that you can re-use yourself in your own leaderboard!</p>
             <p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
     <h2>New leaderboard, new results!</h2>

src/index.html CHANGED Viewed

@@ -115,7 +115,7 @@
             <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
             <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
-            <!-- TODO: Interactive prompts exploration -->
         <h3>Why did we choose these subsets?</h3>
             <p>In summary, our criterion were: </p>
@@ -172,6 +172,11 @@
             <p>For the new version of the Open LLM Leaderboard, we have therefore worked together with the amazing EleutherAI team (notably Hailey Schoelkopf, so many, huge kudos!) to update the harness.</p>
             <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
             <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
             <p>This should allow us to keep our version up to date with new features added in the future!</p>
             <p>Enough said on the leaderboard backend and metrics, now let’s turn to the models and model selection/submission.
@@ -202,7 +207,7 @@
         <h3>Better and simpler interface</h3>
             <p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
-            <p>This is thanks to the work of the Gradio team, notably Freddy Boulton, who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a component that you can re-use yourself in your own leaderboard!</p>
             <p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
     <h2>New leaderboard, new results!</h2>

             <p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
             <p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
+            <iframe src="https://open-llm-leaderboard/sample_viewer.hf.space"></iframe>
         <h3>Why did we choose these subsets?</h3>
             <p>In summary, our criterion were: </p>
             <p>For the new version of the Open LLM Leaderboard, we have therefore worked together with the amazing EleutherAI team (notably Hailey Schoelkopf, so many, huge kudos!) to update the harness.</p>
             <p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
             <p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
+            <iframe src="https://open-llm-leaderboard/GenerationVisualizer.hf.space"></iframe>
+            <p>You can explore the visualiser we used here!</p>
             <p>This should allow us to keep our version up to date with new features added in the future!</p>
             <p>Enough said on the leaderboard backend and metrics, now let’s turn to the models and model selection/submission.
         <h3>Better and simpler interface</h3>
             <p>If you’re among our regular users, you may have noticed in the last month that our front end became much faster.</p>
+            <p>This is thanks to the work of the Gradio team, notably [Freddy Boulton](https://huggingface.co/freddyaboulton), who developed a Leaderboard <code>gradio</code> component! It notably loads data client side, which makes any column selection or search virtually instantaneous! It’s also a [component](https://huggingface.co/spaces/freddyaboulton/gradio_leaderboard) that you can re-use yourself in your own leaderboard!</p>
             <p>We’ve also decided to move the FAQ and About tabs to their own dedicated documentation page!</p>
     <h2>New leaderboard, new results!</h2>