metadata

title: SyntaxGym
emoji: 🏋️
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 3.0.13
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  Evaluates Huggingface models on SyntaxGym datasets (targeted syntactic
  evaluations).

Metric Card for SyntaxGym

Metric Description

SyntaxGym is a framework for targeted syntactic evaluation of language models. This metric can be combined with the SyntaxGym dataset to evaluate the syntactic capacities of any Huggingface causal language model.

How to Use

The metric takes a SyntaxGym test suite as input, as well as the name of the model that should be evaluated:

import datasets
import evaluate
import numpy as np

dataset = datasets.load_dataset("cpllab/syntaxgym", "subordination_src-src")
metric = evaluate.load("cpllab/syntaxgym")
result = metric.compute(dataset=dataset["test"], model_id="gpt2")

# Compute suite accuracy. Mean success over items, where "success" is the conjunction
# of all boolean prediction results.
suite_accuracy = result["subordination_src-src"].accuracy

Run the entire SyntaxGym dataset

You can load and evaluate all suites at once by omitting the dataset configuration name (second argument):

import datasets
import evaluate
import numpy as np

dataset = datasets.load_dataset("cpllab/syntaxgym")
metric = evaluate.load("cpllab/syntaxgym")
result = metric.compute(dataset=dataset["test"], model_id="gpt2")

# Compute suite accuracy. Mean success over items, where "success" is the conjunction
# of all boolean prediction results.
suite_accuracies = {suite_name: suite_results.accuracy
                    for suite_name, suite_results in result.items()}
overall_accuracy = np.mean(list(suite_accuracies.values()))

>>> suite_accuracies
{'center_embed': 0.9285714285714286,
 'center_embed_mod': 0.8571428571428571,
 'cleft': 1.0,
 'cleft_modifier': 0.925,
 'fgd_hierarchy': 0.0,
 'fgd_object': 0.9583333333333334,
 'fgd_pp': 0.875,
 'fgd_subject': 0.5,
 'mvrr': 0.7857142857142857,
 'mvrr_mod': 0.75,
 'npi_orc_any': 0.9736842105263158,
 'npi_orc_ever': 1.0,
 'npi_src_any': 0.5789473684210527,
 'npi_src_ever': 0.9210526315789473,
 'npz_ambig': 0.9166666666666666,
 'npz_ambig_mod': 0.875,
 'npz_obj': 1.0,
 'npz_obj_mod': 1.0,
 'number_orc': 0.631578947368421,
 'number_prep': 0.7894736842105263,
 'number_src': 0.7894736842105263,
 'reflexive_orc_fem': 0.47368421052631576,
 'reflexive_orc_masc': 0.8421052631578947,
 'reflexive_prep_fem': 0.21052631578947367,
 'reflexive_prep_masc': 0.7894736842105263,
 'reflexive_src_fem': 0.15789473684210525,
 'reflexive_src_masc': 0.631578947368421,
 'subordination': 1.0,
 'subordination_orc-orc': 1.0,
 'subordination_pp-pp': 1.0,
 'subordination_src-src': 1.0}
>>> overall_accuracy
0.7793839437302936

Inputs

dataset (Dataset): SyntaxGym test suite, represented as a Huggingface dataset. See the dataset reference.
model_id (str): Model used to calculate probabilities of each word. (This is only well defined for causal language models. This includes models such as gpt2, causal variations of BERT, causal versions of T5, and more. The full list can be found in the AutoModelForCausalLM documentation.)
batch_size (int): Maximum batch size for computations
add_start_token (bool): whether to add the start token to each sentence. Defaults to True.
device (str): device to run on, defaults to cuda when available

Output Values

The metric returns a dict of SyntaxGymMetricSuiteResult objects, mapping test suite names to test suite performance. Each inner object has three properties:

accuracy (float): Model accuracy on this suite. This is the accuracy of the conjunction of all boolean predictions per item in the suite.
prediction_results (List[List[bool]]): For each item in the test suite, a list of booleans indicating whether each corresponding prediction came out True. Typically these are combined to yield an accuracy score (but you can simply use the accuracy property).
region_totals (List[Dict[Tuple[str, int], float]): For each item, a mapping from individual region (keys (<condition_name>, <region_number>)) to the float-valued total surprisal for tokens in this region. This is useful for visualization, or if you'd like to use the aggregate surprisal data for other tasks (e.g. reading time prediction or neural activity prediction).

>>> print(result["subordination_src-src"]["prediction_results"][0])
[True]
>>> print(result["subordination_src-src"]["region_totals"][0])
{('sub_no-matrix', 1): 14.905603408813477,
 ('sub_no-matrix', 2): 39.063140869140625,
 ('sub_no-matrix', 3): 26.862628936767578,
 ('sub_no-matrix', 4): 50.56561279296875,
 ('sub_no-matrix', 5): 7.470069408416748,
 ('no-sub_no-matrix', 1): 13.15120792388916,
 ('no-sub_no-matrix', 2): 38.50318908691406,
 ('no-sub_no-matrix', 3): 27.623855590820312,
 ('no-sub_no-matrix', 4): 48.8316535949707,
 ('no-sub_no-matrix', 5): 1.8095952272415161,
 ('sub_matrix', 1): 14.905603408813477,
 ('sub_matrix', 2): 39.063140869140625,
 ('sub_matrix', 3): 26.862628936767578,
 ('sub_matrix', 4): 50.56561279296875,
 ('sub_matrix', 5): 26.532146453857422,
 ('no-sub_matrix', 1): 13.15120792388916,
 ('no-sub_matrix', 2): 38.50318908691406,
 ('no-sub_matrix', 3): 27.623855590820312,
 ('no-sub_matrix', 4): 48.8316535949707,
 ('no-sub_matrix', 5): 38.085227966308594}

Limitations and Bias

TODO

Citation

If you use this metric in your research, please cite:

@inproceedings{gauthier-etal-2020-syntaxgym,
    title = "{S}yntax{G}ym: An Online Platform for Targeted Evaluation of Language Models",
    author = "Gauthier, Jon and Hu, Jennifer and Wilcox, Ethan and Qian, Peng and Levy, Roger",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.10",
    pages = "70--76",
    abstract = "Targeted syntactic evaluations have yielded insights into the generalizations learned by neural network language models. However, this line of research requires an uncommon confluence of skills: both the theoretical knowledge needed to design controlled psycholinguistic experiments, and the technical proficiency needed to train and deploy large-scale language models. We present SyntaxGym, an online platform designed to make targeted evaluations accessible to both experts in NLP and linguistics, reproducible across computing environments, and standardized following the norms of psycholinguistic experimental design. This paper releases two tools of independent value for the computational linguistics community: 1. A website, syntaxgym.org, which centralizes the process of targeted syntactic evaluation and provides easy tools for analysis and visualization; 2. Two command-line tools, {`}syntaxgym{`} and {`}lm-zoo{`}, which allow any user to reproduce targeted syntactic evaluations and general language model inference on their own machine.",
}

If you use the SyntaxGym dataset in your research, please cite:

@inproceedings{Hu:et-al:2020,
 author = {Hu, Jennifer and Gauthier, Jon and Qian, Peng and Wilcox, Ethan and Levy, Roger},
 title = {A systematic assessment of syntactic generalization in neural language models},
 booktitle = {Proceedings of the Association of Computational Linguistics},
 year = {2020}
}