File size: 7,878 Bytes
0e3e57e
092c6b1
 
0e3e57e
 
 
 
 
 
092c6b1
 
 
 
 
0e3e57e
 
092c6b1
 
 
 
 
 
 
 
 
 
 
 
 
5cd2907
092c6b1
 
 
5cd2907
092c6b1
 
 
e00b8f2
092c6b1
 
 
 
5cd2907
 
 
 
 
 
 
 
 
 
 
 
 
e00b8f2
c70b8db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5cd2907
092c6b1
 
 
5cd2907
092c6b1
 
 
 
 
 
 
ea234b8
092c6b1
ea234b8
 
092c6b1
 
 
5cd2907
092c6b1
5cd2907
092c6b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: SyntaxGym
emoji: 🏋️
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 3.0.13
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
  Evaluates Huggingface models on SyntaxGym datasets (targeted syntactic evaluations).
---

# Metric Card for SyntaxGym

## Metric Description

[SyntaxGym][syntaxgym] is a framework for targeted syntactic evaluation of language models. This metric can be combined with the [SyntaxGym dataset][syntaxgym-dataset] to evaluate the syntactic capacities of any Huggingface causal language model.

## How to Use

The metric takes a SyntaxGym test suite as input, as well as the name of the model that should be evaluated:

```python
import datasets
import evaluate
import numpy as np

dataset = datasets.load_dataset("cpllab/syntaxgym", "subordination_src-src")
metric = evaluate.load("cpllab/syntaxgym")
result = metric.compute(dataset=dataset["test"], model_id="gpt2")

# Compute suite accuracy. Mean success over items, where "success" is the conjunction
# of all boolean prediction results.
suite_accuracy = result["subordination_src-src"].accuracy
```

### Run the entire SyntaxGym dataset

You can load and evaluate all suites at once by omitting the dataset configuration name (second argument):

```python
import datasets
import evaluate
import numpy as np

dataset = datasets.load_dataset("cpllab/syntaxgym")
metric = evaluate.load("cpllab/syntaxgym")
result = metric.compute(dataset=dataset["test"], model_id="gpt2")

# Compute suite accuracy. Mean success over items, where "success" is the conjunction
# of all boolean prediction results.
suite_accuracies = {suite_name: suite_results.accuracy
                    for suite_name, suite_results in result.items()}
overall_accuracy = np.mean(list(suite_accuracies.values()))
```

```python
>>> suite_accuracies
{'center_embed': 0.9285714285714286,
 'center_embed_mod': 0.8571428571428571,
 'cleft': 1.0,
 'cleft_modifier': 0.925,
 'fgd_hierarchy': 0.0,
 'fgd_object': 0.9583333333333334,
 'fgd_pp': 0.875,
 'fgd_subject': 0.5,
 'mvrr': 0.7857142857142857,
 'mvrr_mod': 0.75,
 'npi_orc_any': 0.9736842105263158,
 'npi_orc_ever': 1.0,
 'npi_src_any': 0.5789473684210527,
 'npi_src_ever': 0.9210526315789473,
 'npz_ambig': 0.9166666666666666,
 'npz_ambig_mod': 0.875,
 'npz_obj': 1.0,
 'npz_obj_mod': 1.0,
 'number_orc': 0.631578947368421,
 'number_prep': 0.7894736842105263,
 'number_src': 0.7894736842105263,
 'reflexive_orc_fem': 0.47368421052631576,
 'reflexive_orc_masc': 0.8421052631578947,
 'reflexive_prep_fem': 0.21052631578947367,
 'reflexive_prep_masc': 0.7894736842105263,
 'reflexive_src_fem': 0.15789473684210525,
 'reflexive_src_masc': 0.631578947368421,
 'subordination': 1.0,
 'subordination_orc-orc': 1.0,
 'subordination_pp-pp': 1.0,
 'subordination_src-src': 1.0}
>>> overall_accuracy
0.7793839437302936
```

### Inputs

- **dataset** (`Dataset`): SyntaxGym test suite, represented as a Huggingface dataset. See the [dataset reference][syntaxgym-dataset].
- **model_id** (str): Model used to calculate probabilities of each word. (This is only well defined for causal language models. This includes models such as `gpt2`, causal variations of BERT, causal versions of T5, and more. The full list can be found in the [`AutoModelForCausalLM` documentation][causal].)
- **batch_size** (int): Maximum batch size for computations
- **add_start_token** (bool): whether to add the start token to each sentence. Defaults to `True`.
- **device** (str): device to run on, defaults to `cuda` when available

### Output Values

The metric returns a dict of `SyntaxGymMetricSuiteResult` objects, mapping test suite names to test suite performance. Each inner object has three properties:

- **accuracy** (`float`): Model accuracy on this suite. This is the accuracy of the conjunction of all boolean predictions per item in the suite.
- **prediction_results** (`List[List[bool]]`): For each item in the test suite, a list of booleans indicating whether each corresponding prediction came out `True`. Typically these are combined to yield an accuracy score (but you can simply use the `accuracy` property).
- **region_totals** (`List[Dict[Tuple[str, int], float]`): For each item, a mapping from individual region (keys `(<condition_name>, <region_number>)`) to the float-valued total surprisal for tokens in this region. This is useful for visualization, or if you'd like to use the aggregate surprisal data for other tasks (e.g. reading time prediction or neural activity prediction).

```python
>>> print(result["subordination_src-src"]["prediction_results"][0])
[True]
>>> print(result["subordination_src-src"]["region_totals"][0])
{('sub_no-matrix', 1): 14.905603408813477,
 ('sub_no-matrix', 2): 39.063140869140625,
 ('sub_no-matrix', 3): 26.862628936767578,
 ('sub_no-matrix', 4): 50.56561279296875,
 ('sub_no-matrix', 5): 7.470069408416748,
 ('no-sub_no-matrix', 1): 13.15120792388916,
 ('no-sub_no-matrix', 2): 38.50318908691406,
 ('no-sub_no-matrix', 3): 27.623855590820312,
 ('no-sub_no-matrix', 4): 48.8316535949707,
 ('no-sub_no-matrix', 5): 1.8095952272415161,
 ('sub_matrix', 1): 14.905603408813477,
 ('sub_matrix', 2): 39.063140869140625,
 ('sub_matrix', 3): 26.862628936767578,
 ('sub_matrix', 4): 50.56561279296875,
 ('sub_matrix', 5): 26.532146453857422,
 ('no-sub_matrix', 1): 13.15120792388916,
 ('no-sub_matrix', 2): 38.50318908691406,
 ('no-sub_matrix', 3): 27.623855590820312,
 ('no-sub_matrix', 4): 48.8316535949707,
 ('no-sub_matrix', 5): 38.085227966308594}
 ```

 ## Limitations and Bias

 TODO

 ## Citation

 If you use this metric in your research, please cite:

```bibtex
@inproceedings{gauthier-etal-2020-syntaxgym,
	title = "{S}yntax{G}ym: An Online Platform for Targeted Evaluation of Language Models",
	author = "Gauthier, Jon and Hu, Jennifer and Wilcox, Ethan and Qian, Peng and Levy, Roger",
	booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
	month = jul,
	year = "2020",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/2020.acl-demos.10",
	pages = "70--76",
	abstract = "Targeted syntactic evaluations have yielded insights into the generalizations learned by neural network language models. However, this line of research requires an uncommon confluence of skills: both the theoretical knowledge needed to design controlled psycholinguistic experiments, and the technical proficiency needed to train and deploy large-scale language models. We present SyntaxGym, an online platform designed to make targeted evaluations accessible to both experts in NLP and linguistics, reproducible across computing environments, and standardized following the norms of psycholinguistic experimental design. This paper releases two tools of independent value for the computational linguistics community: 1. A website, syntaxgym.org, which centralizes the process of targeted syntactic evaluation and provides easy tools for analysis and visualization; 2. Two command-line tools, {`}syntaxgym{`} and {`}lm-zoo{`}, which allow any user to reproduce targeted syntactic evaluations and general language model inference on their own machine.",
}
```

 If you use the [SyntaxGym dataset][syntaxgym-dataset] in your research, please cite:

 ```bibtex
@inproceedings{Hu:et-al:2020,
  author = {Hu, Jennifer and Gauthier, Jon and Qian, Peng and Wilcox, Ethan and Levy, Roger},
  title = {A systematic assessment of syntactic generalization in neural language models},
  booktitle = {Proceedings of the Association of Computational Linguistics},
  year = {2020}
}
 ```

[syntaxgym]: https://syntaxgym.org
[syntaxgym-dataset]: https://huggingface.co/datasets/cpllab/syntaxgym
[causal]: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM