lvwerra HF staff commited on
Commit
a49ed8a
1 Parent(s): acea538

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +112 -5
  2. app.py +6 -0
  3. glue.py +156 -0
  4. requirements.txt +5 -0
README.md CHANGED
@@ -1,12 +1,119 @@
1
  ---
2
- title: Glue
3
- emoji: 📚
4
- colorFrom: red
5
- colorTo: pink
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: GLUE
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for GLUE
16
+
17
+ ## Metric description
18
+ This metric is used to compute the GLUE evaluation metric associated to each [GLUE dataset](https://huggingface.co/datasets/glue).
19
+
20
+ GLUE, the General Language Understanding Evaluation benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
21
+
22
+ ## How to use
23
+
24
+ There are two steps: (1) loading the GLUE metric relevant to the subset of the GLUE dataset being used for evaluation; and (2) calculating the metric.
25
+
26
+ 1. **Loading the relevant GLUE metric** : the subsets of GLUE are the following: `sst2`, `mnli`, `mnli_mismatched`, `mnli_matched`, `qnli`, `rte`, `wnli`, `cola`,`stsb`, `mrpc`, `qqp`, and `hans`.
27
+
28
+ More information about the different subsets of the GLUE dataset can be found on the [GLUE dataset page](https://huggingface.co/datasets/glue).
29
+
30
+ 2. **Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one lists of references for each translation.
31
+
32
+ ```python
33
+ from evaluate import load
34
+ glue_metric = load('glue', 'sst2')
35
+ references = [0, 1]
36
+ predictions = [0, 1]
37
+ results = glue_metric.compute(predictions=predictions, references=references)
38
+ ```
39
+ ## Output values
40
+
41
+ The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
42
+
43
+ `accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).
44
+
45
+ `f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
46
+
47
+ `pearson`: a measure of the linear relationship between two datasets (see [Pearson correlation](https://huggingface.co/metrics/pearsonr) for more information). Its range is between -1 and +1, with 0 implying no correlation, and -1/+1 implying an exact linear relationship. Positive correlations imply that as x increases, so does y, whereas negative correlations imply that as x increases, y decreases.
48
+
49
+ `spearmanr`: a nonparametric measure of the monotonicity of the relationship between two datasets(see [Spearman Correlation](https://huggingface.co/metrics/spearmanr) for more information). `spearmanr` has the same range as `pearson`.
50
+
51
+ `matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
52
+
53
+ The `cola` subset returns `matthews_correlation`, the `stsb` subset returns `pearson` and `spearmanr`, the `mrpc` and `qqp` subsets return both `accuracy` and `f1`, and all other subsets of GLUE return only accuracy.
54
+
55
+ ### Values from popular papers
56
+ The [original GLUE paper](https://huggingface.co/datasets/glue) reported average scores ranging from 58 to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
57
+
58
+ For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/glue).
59
+
60
+ ## Examples
61
+
62
+ Maximal values for the MRPC subset (which outputs `accuracy` and `f1`):
63
+
64
+ ```python
65
+ from evaluate import load
66
+ glue_metric = load('glue', 'mrpc') # 'mrpc' or 'qqp'
67
+ references = [0, 1]
68
+ predictions = [0, 1]
69
+ results = glue_metric.compute(predictions=predictions, references=references)
70
+ print(results)
71
+ {'accuracy': 1.0, 'f1': 1.0}
72
+ ```
73
+
74
+ Minimal values for the STSB subset (which outputs `pearson` and `spearmanr`):
75
+
76
+ ```python
77
+ from evaluate import load
78
+ glue_metric = load('glue', 'stsb')
79
+ references = [0., 1., 2., 3., 4., 5.]
80
+ predictions = [-10., -11., -12., -13., -14., -15.]
81
+ results = glue_metric.compute(predictions=predictions, references=references)
82
+ print(results)
83
+ {'pearson': -1.0, 'spearmanr': -1.0}
84
+ ```
85
+
86
+ Partial match for the COLA subset (which outputs `matthews_correlation`)
87
+
88
+ ```python
89
+ from evaluate import load
90
+ glue_metric = load('glue', 'cola')
91
+ references = [0, 1]
92
+ predictions = [1, 1]
93
+ results = glue_metric.compute(predictions=predictions, references=references)
94
+ results
95
+ {'matthews_correlation': 0.0}
96
+ ```
97
+
98
+ ## Limitations and bias
99
+ This metric works only with datasets that have the same format as the [GLUE dataset](https://huggingface.co/datasets/glue).
100
+
101
+ While the GLUE dataset is meant to represent "General Language Understanding", the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such.
102
+
103
+ Also, while the GLUE subtasks were considered challenging during its creation in 2019, they are no longer considered as such given the impressive progress made since then. A more complex (or "stickier") version of it, called [SuperGLUE](https://huggingface.co/datasets/super_glue), was subsequently created.
104
+
105
+ ## Citation
106
+
107
+ ```bibtex
108
+ @inproceedings{wang2019glue,
109
+ title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
110
+ author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
111
+ note={In the Proceedings of ICLR.},
112
+ year={2019}
113
+ }
114
+ ```
115
+
116
+ ## Further References
117
+
118
+ - [GLUE benchmark homepage](https://gluebenchmark.com/)
119
+ - [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("glue")
6
+ launch_gradio_widget(module)
glue.py ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ GLUE benchmark metric. """
15
+
16
+ import datasets
17
+ from scipy.stats import pearsonr, spearmanr
18
+ from sklearn.metrics import f1_score, matthews_corrcoef
19
+
20
+ import evaluate
21
+
22
+
23
+ _CITATION = """\
24
+ @inproceedings{wang2019glue,
25
+ title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
26
+ author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
27
+ note={In the Proceedings of ICLR.},
28
+ year={2019}
29
+ }
30
+ """
31
+
32
+ _DESCRIPTION = """\
33
+ GLUE, the General Language Understanding Evaluation benchmark
34
+ (https://gluebenchmark.com/) is a collection of resources for training,
35
+ evaluating, and analyzing natural language understanding systems.
36
+ """
37
+
38
+ _KWARGS_DESCRIPTION = """
39
+ Compute GLUE evaluation metric associated to each GLUE dataset.
40
+ Args:
41
+ predictions: list of predictions to score.
42
+ Each translation should be tokenized into a list of tokens.
43
+ references: list of lists of references for each translation.
44
+ Each reference should be tokenized into a list of tokens.
45
+ Returns: depending on the GLUE subset, one or several of:
46
+ "accuracy": Accuracy
47
+ "f1": F1 score
48
+ "pearson": Pearson Correlation
49
+ "spearmanr": Spearman Correlation
50
+ "matthews_correlation": Matthew Correlation
51
+ Examples:
52
+
53
+ >>> glue_metric = evaluate.load('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
54
+ >>> references = [0, 1]
55
+ >>> predictions = [0, 1]
56
+ >>> results = glue_metric.compute(predictions=predictions, references=references)
57
+ >>> print(results)
58
+ {'accuracy': 1.0}
59
+
60
+ >>> glue_metric = evaluate.load('glue', 'mrpc') # 'mrpc' or 'qqp'
61
+ >>> references = [0, 1]
62
+ >>> predictions = [0, 1]
63
+ >>> results = glue_metric.compute(predictions=predictions, references=references)
64
+ >>> print(results)
65
+ {'accuracy': 1.0, 'f1': 1.0}
66
+
67
+ >>> glue_metric = evaluate.load('glue', 'stsb')
68
+ >>> references = [0., 1., 2., 3., 4., 5.]
69
+ >>> predictions = [0., 1., 2., 3., 4., 5.]
70
+ >>> results = glue_metric.compute(predictions=predictions, references=references)
71
+ >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
72
+ {'pearson': 1.0, 'spearmanr': 1.0}
73
+
74
+ >>> glue_metric = evaluate.load('glue', 'cola')
75
+ >>> references = [0, 1]
76
+ >>> predictions = [0, 1]
77
+ >>> results = glue_metric.compute(predictions=predictions, references=references)
78
+ >>> print(results)
79
+ {'matthews_correlation': 1.0}
80
+ """
81
+
82
+
83
+ def simple_accuracy(preds, labels):
84
+ return float((preds == labels).mean())
85
+
86
+
87
+ def acc_and_f1(preds, labels):
88
+ acc = simple_accuracy(preds, labels)
89
+ f1 = float(f1_score(y_true=labels, y_pred=preds))
90
+ return {
91
+ "accuracy": acc,
92
+ "f1": f1,
93
+ }
94
+
95
+
96
+ def pearson_and_spearman(preds, labels):
97
+ pearson_corr = float(pearsonr(preds, labels)[0])
98
+ spearman_corr = float(spearmanr(preds, labels)[0])
99
+ return {
100
+ "pearson": pearson_corr,
101
+ "spearmanr": spearman_corr,
102
+ }
103
+
104
+
105
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
106
+ class Glue(evaluate.EvaluationModule):
107
+ def _info(self):
108
+ if self.config_name not in [
109
+ "sst2",
110
+ "mnli",
111
+ "mnli_mismatched",
112
+ "mnli_matched",
113
+ "cola",
114
+ "stsb",
115
+ "mrpc",
116
+ "qqp",
117
+ "qnli",
118
+ "rte",
119
+ "wnli",
120
+ "hans",
121
+ ]:
122
+ raise KeyError(
123
+ "You should supply a configuration name selected in "
124
+ '["sst2", "mnli", "mnli_mismatched", "mnli_matched", '
125
+ '"cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans"]'
126
+ )
127
+ return evaluate.EvaluationModuleInfo(
128
+ description=_DESCRIPTION,
129
+ citation=_CITATION,
130
+ inputs_description=_KWARGS_DESCRIPTION,
131
+ features=datasets.Features(
132
+ {
133
+ "predictions": datasets.Value("int64" if self.config_name != "stsb" else "float32"),
134
+ "references": datasets.Value("int64" if self.config_name != "stsb" else "float32"),
135
+ }
136
+ ),
137
+ codebase_urls=[],
138
+ reference_urls=[],
139
+ format="numpy",
140
+ )
141
+
142
+ def _compute(self, predictions, references):
143
+ if self.config_name == "cola":
144
+ return {"matthews_correlation": matthews_corrcoef(references, predictions)}
145
+ elif self.config_name == "stsb":
146
+ return pearson_and_spearman(predictions, references)
147
+ elif self.config_name in ["mrpc", "qqp"]:
148
+ return acc_and_f1(predictions, references)
149
+ elif self.config_name in ["sst2", "mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]:
150
+ return {"accuracy": simple_accuracy(predictions, references)}
151
+ else:
152
+ raise KeyError(
153
+ "You should supply a configuration name selected in "
154
+ '["sst2", "mnli", "mnli_mismatched", "mnli_matched", '
155
+ '"cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans"]'
156
+ )
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ scipy
5
+ sklearn