saattrupdan
commited on
Commit
·
79f114e
1
Parent(s):
5d40291
feat: Add citation
Browse files
app.py
CHANGED
@@ -42,13 +42,12 @@ available](https://scandeval.com).
|
|
42 |
The generative models are evaluated using in-context learning with few-shot prompts.
|
43 |
The few-shot examples are sampled randomly from the training split, and we benchmark
|
44 |
the models 10 times with bootstrapped test sets and different few-shot examples in each
|
45 |
-
iteration. This allows us to better measure the uncertainty of the results.
|
46 |
-
|
47 |
-
|
48 |
-
percentage of other models that a model beats on a task
|
49 |
-
|
50 |
-
|
51 |
-
model significantly beats another model.
|
52 |
|
53 |
## The Benchmark Datasets
|
54 |
|
@@ -104,6 +103,22 @@ classification, we use the probabilities of the answer letter (a, b, c or d) to
|
|
104 |
the answer. The datasets in this task are machine translated versions of the
|
105 |
[HellaSwag](https://rowanzellers.com/hellaswag/) dataset. We use the Matthews
|
106 |
Correlation Coefficient (MCC) as the evaluation metric.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
107 |
"""
|
108 |
|
109 |
|
|
|
42 |
The generative models are evaluated using in-context learning with few-shot prompts.
|
43 |
The few-shot examples are sampled randomly from the training split, and we benchmark
|
44 |
the models 10 times with bootstrapped test sets and different few-shot examples in each
|
45 |
+
iteration. This allows us to better measure the uncertainty of the results. We use the
|
46 |
+
uncertainty in the radial plot when we compute the win ratios (i.e., the percentage of
|
47 |
+
other models that a model beats on a task). Namely, we compute the win ratio as the
|
48 |
+
percentage of other models that a model _significantly_ beats on a task, where we use a
|
49 |
+
paired t-test with a significance level of 0.05 to determine whether a model
|
50 |
+
significantly beats another model.
|
|
|
51 |
|
52 |
## The Benchmark Datasets
|
53 |
|
|
|
103 |
the answer. The datasets in this task are machine translated versions of the
|
104 |
[HellaSwag](https://rowanzellers.com/hellaswag/) dataset. We use the Matthews
|
105 |
Correlation Coefficient (MCC) as the evaluation metric.
|
106 |
+
|
107 |
+
|
108 |
+
## Citation
|
109 |
+
|
110 |
+
If you use the ScandEval benchmark in your work, please cite [the
|
111 |
+
paper](https://aclanthology.org/2023.nodalida-1.20):
|
112 |
+
|
113 |
+
```
|
114 |
+
@inproceedings{nielsen2023scandeval,
|
115 |
+
title={ScandEval: A Benchmark for Scandinavian Natural Language Processing},
|
116 |
+
author={Nielsen, Dan},
|
117 |
+
booktitle={Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
|
118 |
+
pages={185--201},
|
119 |
+
year={2023}
|
120 |
+
}
|
121 |
+
```
|
122 |
"""
|
123 |
|
124 |
|