---
title: SyntaxGym
emoji: 🏋️
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 3.0.13
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
  Evaluates Huggingface models on SyntaxGym datasets (targeted syntactic evaluations).
---

# Metric Card for SyntaxGym

## Metric Description

[SyntaxGym][syntaxgym] is a framework for targeted syntactic evaluation of language models. This metric can be combined with the [SyntaxGym dataset][syntaxgym-dataset] to evaluate the syntactic capacities of any Huggingface causal language model.

## How to Use

The metric takes a SyntaxGym test suite as input, as well as the name of the model that should be evaluated:

```python
import datasets
import evaluate
import numpy as np

dataset = datasets.load_dataset("cpllab/syntaxgym", "subordination_src-src")
metric = evaluate.load("cpllab/syntaxgym")
result = metric.compute(dataset=dataset["test"], model_id="gpt2")

# Compute suite accuracy. Mean success over items, where "success" is the conjunction
# of all boolean prediction results.
suite_accuracy = result["subordination_src-src"].accuracy
```

### Run the entire SyntaxGym dataset

You can load and evaluate all suites at once by omitting the dataset configuration name (second argument):

```python
import datasets
import evaluate
import numpy as np

dataset = datasets.load_dataset("cpllab/syntaxgym")
metric = evaluate.load("cpllab/syntaxgym")
result = metric.compute(dataset=dataset["test"], model_id="gpt2")

# Compute suite accuracy. Mean success over items, where "success" is the conjunction
# of all boolean prediction results.
suite_accuracies = {suite_name: suite_results.accuracy
                    for suite_name, suite_results in result.items()}
overall_accuracy = np.mean(list(suite_accuracies.values()))
```

```python
>>> suite_accuracies
{'center_embed': 0.9285714285714286,
 'center_embed_mod': 0.8571428571428571,
 'cleft': 1.0,
 'cleft_modifier': 0.925,
 'fgd_hierarchy': 0.0,
 'fgd_object': 0.9583333333333334,
 'fgd_pp': 0.875,
 'fgd_subject': 0.5,
 'mvrr': 0.7857142857142857,
 'mvrr_mod': 0.75,
 'npi_orc_any': 0.9736842105263158,
 'npi_orc_ever': 1.0,
 'npi_src_any': 0.5789473684210527,
 'npi_src_ever': 0.9210526315789473,
 'npz_ambig': 0.9166666666666666,
 'npz_ambig_mod': 0.875,
 'npz_obj': 1.0,
 'npz_obj_mod': 1.0,
 'number_orc': 0.631578947368421,
 'number_prep': 0.7894736842105263,
 'number_src': 0.7894736842105263,
 'reflexive_orc_fem': 0.47368421052631576,
 'reflexive_orc_masc': 0.8421052631578947,
 'reflexive_prep_fem': 0.21052631578947367,
 'reflexive_prep_masc': 0.7894736842105263,
 'reflexive_src_fem': 0.15789473684210525,
 'reflexive_src_masc': 0.631578947368421,
 'subordination': 1.0,
 'subordination_orc-orc': 1.0,
 'subordination_pp-pp': 1.0,
 'subordination_src-src': 1.0}
>>> overall_accuracy
0.7793839437302936
```

### Inputs

- **dataset** (`Dataset`): SyntaxGym test suite, represented as a Huggingface dataset. See the [dataset reference][syntaxgym-dataset].
- **model_id** (str): Model used to calculate probabilities of each word. (This is only well defined for causal language models. This includes models such as `gpt2`, causal variations of BERT, causal versions of T5, and more. The full list can be found in the [`AutoModelForCausalLM` documentation][causal].)
- **batch_size** (int): Maximum batch size for computations
- **add_start_token** (bool): whether to add the start token to each sentence. Defaults to `True`.
- **device** (str): device to run on, defaults to `cuda` when available

### Output Values

The metric returns a dict of `SyntaxGymMetricSuiteResult` objects, mapping test suite names to test suite performance. Each inner object has three properties:

- **accuracy** (`float`): Model accuracy on this suite. This is the accuracy of the conjunction of all boolean predictions per item in the suite.
- **prediction_results** (`List[List[bool]]`): For each item in the test suite, a list of booleans indicating whether each corresponding prediction came out `True`. Typically these are combined to yield an accuracy score (but you can simply use the `accuracy` property).
- **region_totals** (`List[Dict[Tuple[str, int], float]`): For each item, a mapping from individual region (keys `(<condition_name>, <region_number>)`) to the float-valued total surprisal for tokens in this region. This is useful for visualization, or if you'd like to use the aggregate surprisal data for other tasks (e.g. reading time prediction or neural activity prediction).

```python
>>> print(result["subordination_src-src"]["prediction_results"][0])
[True]
>>> print(result["subordination_src-src"]["region_totals"][0])
{('sub_no-matrix', 1): 14.905603408813477,
 ('sub_no-matrix', 2): 39.063140869140625,
 ('sub_no-matrix', 3): 26.862628936767578,
 ('sub_no-matrix', 4): 50.56561279296875,
 ('sub_no-matrix', 5): 7.470069408416748,
 ('no-sub_no-matrix', 1): 13.15120792388916,
 ('no-sub_no-matrix', 2): 38.50318908691406,
 ('no-sub_no-matrix', 3): 27.623855590820312,
 ('no-sub_no-matrix', 4): 48.8316535949707,
 ('no-sub_no-matrix', 5): 1.8095952272415161,
 ('sub_matrix', 1): 14.905603408813477,
 ('sub_matrix', 2): 39.063140869140625,
 ('sub_matrix', 3): 26.862628936767578,
 ('sub_matrix', 4): 50.56561279296875,
 ('sub_matrix', 5): 26.532146453857422,
 ('no-sub_matrix', 1): 13.15120792388916,
 ('no-sub_matrix', 2): 38.50318908691406,
 ('no-sub_matrix', 3): 27.623855590820312,
 ('no-sub_matrix', 4): 48.8316535949707,
 ('no-sub_matrix', 5): 38.085227966308594}
 ```

 ## Limitations and Bias

 TODO

 ## Citation

 If you use this metric in your research, please cite:

```bibtex
@inproceedings{gauthier-etal-2020-syntaxgym,
	title = "{S}yntax{G}ym: An Online Platform for Targeted Evaluation of Language Models",
	author = "Gauthier, Jon and Hu, Jennifer and Wilcox, Ethan and Qian, Peng and Levy, Roger",
	booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
	month = jul,
	year = "2020",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/2020.acl-demos.10",
	pages = "70--76",
	abstract = "Targeted syntactic evaluations have yielded insights into the generalizations learned by neural network language models. However, this line of research requires an uncommon confluence of skills: both the theoretical knowledge needed to design controlled psycholinguistic experiments, and the technical proficiency needed to train and deploy large-scale language models. We present SyntaxGym, an online platform designed to make targeted evaluations accessible to both experts in NLP and linguistics, reproducible across computing environments, and standardized following the norms of psycholinguistic experimental design. This paper releases two tools of independent value for the computational linguistics community: 1. A website, syntaxgym.org, which centralizes the process of targeted syntactic evaluation and provides easy tools for analysis and visualization; 2. Two command-line tools, {`}syntaxgym{`} and {`}lm-zoo{`}, which allow any user to reproduce targeted syntactic evaluations and general language model inference on their own machine.",
}
```

 If you use the [SyntaxGym dataset][syntaxgym-dataset] in your research, please cite:

 ```bibtex
@inproceedings{Hu:et-al:2020,
  author = {Hu, Jennifer and Gauthier, Jon and Qian, Peng and Wilcox, Ethan and Levy, Roger},
  title = {A systematic assessment of syntactic generalization in neural language models},
  booktitle = {Proceedings of the Association of Computational Linguistics},
  year = {2020}
}
 ```

[syntaxgym]: https://syntaxgym.org
[syntaxgym-dataset]: https://huggingface.co/datasets/cpllab/syntaxgym
[causal]: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM