Spaces:

evaluate-metric
/

glue

Running

App Files Files Community

lvwerra HF staff commited on May 20, 2022

Commit

a49ed8a

•

1 Parent(s): acea538

Update Space (evaluate main: 828c6327)

Browse files

Files changed (4) hide show

README.md +112 -5
app.py +6 -0
glue.py +156 -0
requirements.txt +5 -0

README.md CHANGED Viewed

@@ -1,12 +1,119 @@
 ---
-title: Glue
-emoji: 📚
-colorFrom: red
-colorTo: pink
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: GLUE
+emoji: 🤗
+colorFrom: blue
+colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+# Metric Card for GLUE
+## Metric description
+This metric is used to compute the GLUE evaluation metric associated to each [GLUE dataset](https://huggingface.co/datasets/glue).
+GLUE, the General Language Understanding Evaluation benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
+## How to use
+There are two steps: (1) loading the GLUE metric relevant to the subset of the GLUE dataset being used for evaluation; and (2) calculating the metric.
+1. **Loading the relevant GLUE metric** : the subsets of GLUE are the following: `sst2`,  `mnli`, `mnli_mismatched`, `mnli_matched`, `qnli`, `rte`, `wnli`, `cola`,`stsb`, `mrpc`, `qqp`, and `hans`.
+More information about the different subsets of the GLUE dataset can be found on the [GLUE dataset page](https://huggingface.co/datasets/glue).
+2. **Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one lists of references for each translation.
+```python
+from evaluate import load
+glue_metric = load('glue', 'sst2')
+references = [0, 1]
+predictions = [0, 1]
+results = glue_metric.compute(predictions=predictions, references=references)
+```
+## Output values
+The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
+`accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).
+`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
+`pearson`: a measure of the linear relationship between two datasets (see [Pearson correlation](https://huggingface.co/metrics/pearsonr) for more information). Its range is between -1 and +1, with 0 implying no correlation, and -1/+1 implying an exact linear relationship. Positive correlations imply that as x increases, so does y, whereas negative correlations imply that as x increases, y decreases.
+`spearmanr`:  a nonparametric measure of the monotonicity of the relationship between two datasets(see [Spearman Correlation](https://huggingface.co/metrics/spearmanr) for more information). `spearmanr` has the same range as `pearson`.
+`matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
+The `cola` subset returns `matthews_correlation`, the `stsb` subset returns `pearson` and `spearmanr`, the `mrpc` and `qqp` subsets return both `accuracy` and `f1`, and all other subsets of GLUE return only accuracy.
+### Values from popular papers
+The [original GLUE paper](https://huggingface.co/datasets/glue) reported average scores ranging from 58 to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
+For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/glue).
+## Examples
+Maximal values for the MRPC subset (which outputs `accuracy` and `f1`):
+```python
+from evaluate import load
+glue_metric = load('glue', 'mrpc')  # 'mrpc' or 'qqp'
+references = [0, 1]
+predictions = [0, 1]
+results = glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'accuracy': 1.0, 'f1': 1.0}
+```
+Minimal values for the STSB subset (which outputs `pearson` and `spearmanr`):
+```python
+from evaluate import load
+glue_metric = load('glue', 'stsb')
+references = [0., 1., 2., 3., 4., 5.]
+predictions = [-10., -11., -12., -13., -14., -15.]
+results = glue_metric.compute(predictions=predictions, references=references)
+print(results)
+{'pearson': -1.0, 'spearmanr': -1.0}
+```
+Partial match for the COLA subset (which outputs `matthews_correlation`)
+```python
+from evaluate import load
+glue_metric = load('glue', 'cola')
+references = [0, 1]
+predictions = [1, 1]
+results = glue_metric.compute(predictions=predictions, references=references)
+results
+{'matthews_correlation': 0.0}
+```
+## Limitations and bias
+This metric works only with datasets that have the same format as the [GLUE dataset](https://huggingface.co/datasets/glue).
+While the GLUE dataset is meant to represent "General Language Understanding", the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such.
+Also, while the GLUE subtasks were considered challenging during its creation in 2019, they are no longer considered as such given the impressive progress made since then. A more complex (or "stickier") version of it, called [SuperGLUE](https://huggingface.co/datasets/super_glue), was subsequently created.
+## Citation
+```bibtex
+ @inproceedings{wang2019glue,
+  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
+  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
+  note={In the Proceedings of ICLR.},
+  year={2019}
+}
+```
+## Further References
+- [GLUE benchmark homepage](https://gluebenchmark.com/)
+- [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("glue")
+launch_gradio_widget(module)

glue.py ADDED Viewed

	@@ -0,0 +1,156 @@

+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" GLUE benchmark metric. """
+import datasets
+from scipy.stats import pearsonr, spearmanr
+from sklearn.metrics import f1_score, matthews_corrcoef
+import evaluate
+_CITATION = """\
+@inproceedings{wang2019glue,
+  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
+  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
+  note={In the Proceedings of ICLR.},
+  year={2019}
+}
+"""
+_DESCRIPTION = """\
+GLUE, the General Language Understanding Evaluation benchmark
+(https://gluebenchmark.com/) is a collection of resources for training,
+evaluating, and analyzing natural language understanding systems.
+"""
+_KWARGS_DESCRIPTION = """
+Compute GLUE evaluation metric associated to each GLUE dataset.
+Args:
+    predictions: list of predictions to score.
+        Each translation should be tokenized into a list of tokens.
+    references: list of lists of references for each translation.
+        Each reference should be tokenized into a list of tokens.
+Returns: depending on the GLUE subset, one or several of:
+    "accuracy": Accuracy
+    "f1": F1 score
+    "pearson": Pearson Correlation
+    "spearmanr": Spearman Correlation
+    "matthews_correlation": Matthew Correlation
+Examples:
+    >>> glue_metric = evaluate.load('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
+    >>> references = [0, 1]
+    >>> predictions = [0, 1]
+    >>> results = glue_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'accuracy': 1.0}
+    >>> glue_metric = evaluate.load('glue', 'mrpc')  # 'mrpc' or 'qqp'
+    >>> references = [0, 1]
+    >>> predictions = [0, 1]
+    >>> results = glue_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'accuracy': 1.0, 'f1': 1.0}
+    >>> glue_metric = evaluate.load('glue', 'stsb')
+    >>> references = [0., 1., 2., 3., 4., 5.]
+    >>> predictions = [0., 1., 2., 3., 4., 5.]
+    >>> results = glue_metric.compute(predictions=predictions, references=references)
+    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
+    {'pearson': 1.0, 'spearmanr': 1.0}
+    >>> glue_metric = evaluate.load('glue', 'cola')
+    >>> references = [0, 1]
+    >>> predictions = [0, 1]
+    >>> results = glue_metric.compute(predictions=predictions, references=references)
+    >>> print(results)
+    {'matthews_correlation': 1.0}
+"""
+def simple_accuracy(preds, labels):
+    return float((preds == labels).mean())
+def acc_and_f1(preds, labels):
+    acc = simple_accuracy(preds, labels)
+    f1 = float(f1_score(y_true=labels, y_pred=preds))
+    return {
+        "accuracy": acc,
+        "f1": f1,
+    }
+def pearson_and_spearman(preds, labels):
+    pearson_corr = float(pearsonr(preds, labels)[0])
+    spearman_corr = float(spearmanr(preds, labels)[0])
+    return {
+        "pearson": pearson_corr,
+        "spearmanr": spearman_corr,
+    }
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Glue(evaluate.EvaluationModule):
+    def _info(self):
+        if self.config_name not in [
+            "sst2",
+            "mnli",
+            "mnli_mismatched",
+            "mnli_matched",
+            "cola",
+            "stsb",
+            "mrpc",
+            "qqp",
+            "qnli",
+            "rte",
+            "wnli",
+            "hans",
+        ]:
+            raise KeyError(
+                "You should supply a configuration name selected in "
+                '["sst2", "mnli", "mnli_mismatched", "mnli_matched", '
+                '"cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans"]'
+            )
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("int64" if self.config_name != "stsb" else "float32"),
+                    "references": datasets.Value("int64" if self.config_name != "stsb" else "float32"),
+                }
+            ),
+            codebase_urls=[],
+            reference_urls=[],
+            format="numpy",
+        )
+    def _compute(self, predictions, references):
+        if self.config_name == "cola":
+            return {"matthews_correlation": matthews_corrcoef(references, predictions)}
+        elif self.config_name == "stsb":
+            return pearson_and_spearman(predictions, references)
+        elif self.config_name in ["mrpc", "qqp"]:
+            return acc_and_f1(predictions, references)
+        elif self.config_name in ["sst2", "mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]:
+            return {"accuracy": simple_accuracy(predictions, references)}
+        else:
+            raise KeyError(
+                "You should supply a configuration name selected in "
+                '["sst2", "mnli", "mnli_mismatched", "mnli_matched", '
+                '"cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans"]'
+            )

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+scipy
+sklearn