Spaces:

alexandrainst
/

radial-plot-generator

Sleeping

saattrupdan commited on Jan 27, 2024

Commit

79f114e

1 Parent(s): 5d40291

feat: Add citation

Files changed (1) hide show

app.py CHANGED Viewed

@@ -42,13 +42,12 @@ available](https://scandeval.com).
 The generative models are evaluated using in-context learning with few-shot prompts.
 The few-shot examples are sampled randomly from the training split, and we benchmark
 the models 10 times with bootstrapped test sets and different few-shot examples in each
-iteration. This allows us to better measure the uncertainty of the results.
-We use the uncertainty in the radial plot when we compute the win ratios (i.e., the
-percentage of other models that a model beats on a task). Namely, we compute the win
-ratio as the percentage of other models that a model _significantly_ beats on a task,
-where we use a paired t-test with a significance level of 0.05 to determine whether a
-model significantly beats another model.
 ## The Benchmark Datasets
@@ -104,6 +103,22 @@ classification, we use the probabilities of the answer letter (a, b, c or d) to
 the answer. The datasets in this task are machine translated versions of the
 [HellaSwag](https://rowanzellers.com/hellaswag/) dataset. We use the Matthews
 Correlation Coefficient (MCC) as the evaluation metric.
 """

 The generative models are evaluated using in-context learning with few-shot prompts.
 The few-shot examples are sampled randomly from the training split, and we benchmark
 the models 10 times with bootstrapped test sets and different few-shot examples in each
+iteration. This allows us to better measure the uncertainty of the results. We use the
+uncertainty in the radial plot when we compute the win ratios (i.e., the percentage of
+other models that a model beats on a task). Namely, we compute the win ratio as the
+percentage of other models that a model _significantly_ beats on a task, where we use a
+paired t-test with a significance level of 0.05 to determine whether a model
+significantly beats another model.
 ## The Benchmark Datasets
 the answer. The datasets in this task are machine translated versions of the
 [HellaSwag](https://rowanzellers.com/hellaswag/) dataset. We use the Matthews
 Correlation Coefficient (MCC) as the evaluation metric.
+## Citation
+If you use the ScandEval benchmark in your work, please cite [the
+paper](https://aclanthology.org/2023.nodalida-1.20):
+```
+@inproceedings{nielsen2023scandeval,
+  title={ScandEval: A Benchmark for Scandinavian Natural Language Processing},
+  author={Nielsen, Dan},
+  booktitle={Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
+  pages={185--201},
+  year={2023}
+}
+```
 """