eduardo-alvarez commited on
Commit
d050f48
1 Parent(s): 8736d58

Update info/about.py

Browse files
Files changed (1) hide show
  1. info/about.py +0 -1
info/about.py CHANGED
@@ -25,7 +25,6 @@ Our current benchmarks include:
25
  - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU (5-shot)</a> - a test measuring a text model's multitask accuracy, covering 57 tasks in fields like elementary mathematics, US history, computer science, law, and more.
26
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA (0-shot)</a> - a test measuring a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
27
  - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande (5-shot)</a> - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
28
- - <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k (5-shot)</a> - diverse grade school math word problems measuring a model's ability to solve multi-step mathematical reasoning problems.
29
  For all these evaluations, a higher score is better. We've chosen these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. In the future, we plan to add domain-specific benchmarks to further evaluate our models.
30
 
31
  We run an adapted version of the benchmark code specifically designed to run the EleutherAI Harness benchmarks on Gaudi processors.
 
25
  - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU (5-shot)</a> - a test measuring a text model's multitask accuracy, covering 57 tasks in fields like elementary mathematics, US history, computer science, law, and more.
26
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA (0-shot)</a> - a test measuring a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
27
  - <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande (5-shot)</a> - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
 
28
  For all these evaluations, a higher score is better. We've chosen these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. In the future, we plan to add domain-specific benchmarks to further evaluate our models.
29
 
30
  We run an adapted version of the benchmark code specifically designed to run the EleutherAI Harness benchmarks on Gaudi processors.