jgauthier commited on
Commit
092c6b1
1 Parent(s): 27bb1ab

partial readme draft

Browse files
Files changed (1) hide show
  1. README.md +112 -3
README.md CHANGED
@@ -1,12 +1,121 @@
1
  ---
2
- title: Syntaxgym
3
- emoji: 🐨
4
  colorFrom: pink
5
  colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 3.0.13
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: SyntaxGym
3
+ emoji: 🏋️
4
  colorFrom: pink
5
  colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 3.0.13
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
+ description: >-
14
+ Evaluates Huggingface models on SyntaxGym datasets (targeted syntactic evaluations).
15
  ---
16
 
17
+ # Metric Card for SyntaxGym
18
+
19
+ ## Metric Description
20
+
21
+ [SyntaxGym][syntaxgym] is a framework for targeted syntactic evaluation of language models. This metric can be combined with the [SyntaxGym dataset][syntaxgym-dataset] to evaluate the syntactic capacities of any Huggingface causal language model.
22
+
23
+ ## How to Use
24
+
25
+ The metric takes a SyntaxGym test suite as input, as well as the name of the model that should be evaluated:
26
+
27
+ ```python
28
+ import datasets
29
+ import evaluate
30
+
31
+ dataset = datasets.load_dataset("cpllab/syntaxgym", "subordination_src-src")
32
+ metric = evaluate.load("cpllab/syntaxgym")
33
+ result = metric.compute(suite=dataset["test"], model_id="gpt2")
34
+
35
+ # Compute suite accuracy. Mean success over items, where "success" is the conjunction
36
+ # of all boolean prediction results.
37
+ suite_accuracy = result["prediction_results"].all(axis=1).mean(axis=0)
38
+ ```
39
+
40
+ ### Run the entire SyntaxGym dataset
41
+
42
+ TODO
43
+
44
+ ### Inputs
45
+
46
+ - **suite** (`Dataset`): SyntaxGym test suite, represented as a Huggingface dataset. See the [dataset reference][syntaxgym-dataset].
47
+ - **model_id** (str): Model used to calculate probabilities of each word. (This is only well defined for causal language models. This includes models such as `gpt2`, causal variations of BERT, causal versions of T5, and more. The full list can be found in the [`AutoModelForCausalLM` documentation][causal].)
48
+ - **batch_size** (int): Maximum batch size for computations
49
+ - **add_start_token** (bool): whether to add the start token to each sentence. Defaults to `True`.
50
+ - **device** (str): device to run on, defaults to `cuda` when available
51
+
52
+ ### Output Values
53
+
54
+ The metric returns a dict with two entries:
55
+
56
+ - **prediction_results** (`List[List[bool]]`): For each item in the test suite, a list of booleans indicating whether each corresponding prediction came out `True`. Typically these are combined to yield an accuracy score (see example usage above).
57
+ - **region_totals** (`List[Dict[Tuple[str, int], float]`): For each item, a mapping from individual region (keys `(<condition_name>, <region_number>)`) to the float-valued total surprisal for tokens in this region. This is useful for visualization, or if you'd like to use the aggregate surprisal data for other tasks (e.g. reading time prediction or neural activity prediction).
58
+
59
+ ```python
60
+ >>> print(result["prediction_results"][0])
61
+ [True]
62
+ >>> print(result["region_totals"][0])
63
+ {('sub_no-matrix', 1): 14.905603408813477,
64
+ ('sub_no-matrix', 2): 39.063140869140625,
65
+ ('sub_no-matrix', 3): 26.862628936767578,
66
+ ('sub_no-matrix', 4): 50.56561279296875,
67
+ ('sub_no-matrix', 5): 7.470069408416748,
68
+ ('no-sub_no-matrix', 1): 13.15120792388916,
69
+ ('no-sub_no-matrix', 2): 38.50318908691406,
70
+ ('no-sub_no-matrix', 3): 27.623855590820312,
71
+ ('no-sub_no-matrix', 4): 48.8316535949707,
72
+ ('no-sub_no-matrix', 5): 1.8095952272415161,
73
+ ('sub_matrix', 1): 14.905603408813477,
74
+ ('sub_matrix', 2): 39.063140869140625,
75
+ ('sub_matrix', 3): 26.862628936767578,
76
+ ('sub_matrix', 4): 50.56561279296875,
77
+ ('sub_matrix', 5): 26.532146453857422,
78
+ ('no-sub_matrix', 1): 13.15120792388916,
79
+ ('no-sub_matrix', 2): 38.50318908691406,
80
+ ('no-sub_matrix', 3): 27.623855590820312,
81
+ ('no-sub_matrix', 4): 48.8316535949707,
82
+ ('no-sub_matrix', 5): 38.085227966308594}
83
+ ```
84
+
85
+ ## Limitations and Bias
86
+
87
+ TODO
88
+
89
+ ## Citation
90
+
91
+ If you use this metric in your research, please cite:
92
+
93
+ ```bibtex
94
+ @inproceedings{gauthier-etal-2020-syntaxgym,
95
+ title = "{S}yntax{G}ym: An Online Platform for Targeted Evaluation of Language Models",
96
+ author = "Gauthier, Jon and Hu, Jennifer and Wilcox, Ethan and Qian, Peng and Levy, Roger",
97
+ booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
98
+ month = jul,
99
+ year = "2020",
100
+ address = "Online",
101
+ publisher = "Association for Computational Linguistics",
102
+ url = "https://www.aclweb.org/anthology/2020.acl-demos.10",
103
+ pages = "70--76",
104
+ abstract = "Targeted syntactic evaluations have yielded insights into the generalizations learned by neural network language models. However, this line of research requires an uncommon confluence of skills: both the theoretical knowledge needed to design controlled psycholinguistic experiments, and the technical proficiency needed to train and deploy large-scale language models. We present SyntaxGym, an online platform designed to make targeted evaluations accessible to both experts in NLP and linguistics, reproducible across computing environments, and standardized following the norms of psycholinguistic experimental design. This paper releases two tools of independent value for the computational linguistics community: 1. A website, syntaxgym.org, which centralizes the process of targeted syntactic evaluation and provides easy tools for analysis and visualization; 2. Two command-line tools, {`}syntaxgym{`} and {`}lm-zoo{`}, which allow any user to reproduce targeted syntactic evaluations and general language model inference on their own machine.",
105
+ }
106
+ ```
107
+
108
+ If you use the [SyntaxGym dataset][syntaxgym-dataset] in your research, please cite:
109
+
110
+ ```bibtex
111
+ @inproceedings{Hu:et-al:2020,
112
+ author = {Hu, Jennifer and Gauthier, Jon and Qian, Peng and Wilcox, Ethan and Levy, Roger},
113
+ title = {A systematic assessment of syntactic generalization in neural language models},
114
+ booktitle = {Proceedings of the Association of Computational Linguistics},
115
+ year = {2020}
116
+ }
117
+ ```
118
+
119
+ [syntaxgym]: https://syntaxgym.org
120
+ [syntaxgym-dataset]: https://huggingface.co/datasets/cpllab/syntaxgym
121
+ [causal]: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM