Spaces:
Running
Running
Clémentine
commited on
Commit
•
b7060d2
1
Parent(s):
bb264aa
change gradio to component instead of iframe
Browse files- dist/index.html +6 -2
- src/index.html +6 -2
dist/index.html
CHANGED
@@ -115,7 +115,7 @@
|
|
115 |
<p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
|
116 |
<p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
|
117 |
|
118 |
-
<
|
119 |
|
120 |
<h3>Why did we choose these subsets?</h3>
|
121 |
<p>In summary, our criterion were: </p>
|
@@ -173,7 +173,7 @@
|
|
173 |
<p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
|
174 |
<p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
|
175 |
|
176 |
-
<
|
177 |
|
178 |
<p>You can explore the visualiser we used here!</p>
|
179 |
|
@@ -461,5 +461,9 @@
|
|
461 |
<script>
|
462 |
includeHTML();
|
463 |
</script>
|
|
|
|
|
|
|
|
|
464 |
</body>
|
465 |
</html>
|
|
|
115 |
<p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
|
116 |
<p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
|
117 |
|
118 |
+
<gradio-app src="https://open-llm-leaderboard-sample_viewer.hf.space"></gradio-app>
|
119 |
|
120 |
<h3>Why did we choose these subsets?</h3>
|
121 |
<p>In summary, our criterion were: </p>
|
|
|
173 |
<p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
|
174 |
<p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
|
175 |
|
176 |
+
<gradio-app src="https://open-llm-leaderboard-GenerationVisualizer.hf.space"></gradio-app>
|
177 |
|
178 |
<p>You can explore the visualiser we used here!</p>
|
179 |
|
|
|
461 |
<script>
|
462 |
includeHTML();
|
463 |
</script>
|
464 |
+
<script
|
465 |
+
type="module"
|
466 |
+
src="https://gradio.s3-us-west-2.amazonaws.com/4.36.0/gradio.js"
|
467 |
+
></script>
|
468 |
</body>
|
469 |
</html>
|
src/index.html
CHANGED
@@ -115,7 +115,7 @@
|
|
115 |
<p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
|
116 |
<p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
|
117 |
|
118 |
-
<
|
119 |
|
120 |
<h3>Why did we choose these subsets?</h3>
|
121 |
<p>In summary, our criterion were: </p>
|
@@ -173,7 +173,7 @@
|
|
173 |
<p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
|
174 |
<p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
|
175 |
|
176 |
-
<
|
177 |
|
178 |
<p>You can explore the visualiser we used here!</p>
|
179 |
|
@@ -461,5 +461,9 @@
|
|
461 |
<script>
|
462 |
includeHTML();
|
463 |
</script>
|
|
|
|
|
|
|
|
|
464 |
</body>
|
465 |
</html>
|
|
|
115 |
<p>🤝 <strong>IFEval</strong> (Instruction Following Evaluation, <a href="https://arxiv.org/abs/2311.07911">paper</a>). IFEval is a fairly interesting dataset, which tests the capability of models to clearly follow explicit instructions, such as “include keyword x” or “use format y”. The models are tested on their ability to strictly follow formatting instructions, rather than the actual contents generated, which allows the use of strict and rigorous metrics.</p>
|
116 |
<p>🧮 🤝 <strong>BBH</strong> (Big Bench Hard, <a href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset, which 1) use objective metrics, 2) are hard, measured as language models not originally outperforming human baselines, 3) contain enough samples to be statistically significant. They contain multistep arithmetic and algorithmic reasoning (understanding boolean expressions, svg for geometric shapes, etc), language understanding (sarcasm detection, name disambiguation, etc), and some world knowledge. Performance on BBH has been on average very well correlated with human preference. We expect this dataset to provide interesting insights on specific capabilities which could interest people.</p>
|
117 |
|
118 |
+
<gradio-app src="https://open-llm-leaderboard-sample_viewer.hf.space"></gradio-app>
|
119 |
|
120 |
<h3>Why did we choose these subsets?</h3>
|
121 |
<p>In summary, our criterion were: </p>
|
|
|
173 |
<p>Features side, we added in the harness support for delta weights (LoRA finetuning/adaptation of models), a logging system compatible with the leaderboard, and the highly requested use of chat templates for evaluation.</p>
|
174 |
<p>On the task side, we took a couple of weeks to manually check all implementations and generations thoroughly, and fix the problems we observed with inconsistent few shot samples, too restrictive end of sentence tokens, etc. We created specific configuration files for the leaderboard task implementations, and are now working on adding a test suite to make sure that evaluation results stay unchanging through time for the leaderboard tasks.</p>
|
175 |
|
176 |
+
<gradio-app src="https://open-llm-leaderboard-GenerationVisualizer.hf.space"></gradio-app>
|
177 |
|
178 |
<p>You can explore the visualiser we used here!</p>
|
179 |
|
|
|
461 |
<script>
|
462 |
includeHTML();
|
463 |
</script>
|
464 |
+
<script
|
465 |
+
type="module"
|
466 |
+
src="https://gradio.s3-us-west-2.amazonaws.com/4.36.0/gradio.js"
|
467 |
+
></script>
|
468 |
</body>
|
469 |
</html>
|