--- title: GLUE emoji: 🤗 colorFrom: blue colorTo: red sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false tags: - evaluate - metric description: >- GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. --- # Metric Card for GLUE ## Metric description This metric is used to compute the GLUE evaluation metric associated to each [GLUE dataset](https://huggingface.co/datasets/glue). GLUE, the General Language Understanding Evaluation benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. ## How to use There are two steps: (1) loading the GLUE metric relevant to the subset of the GLUE dataset being used for evaluation; and (2) calculating the metric. 1. **Loading the relevant GLUE metric** : the subsets of GLUE are the following: `sst2`, `mnli`, `mnli_mismatched`, `mnli_matched`, `qnli`, `rte`, `wnli`, `cola`,`stsb`, `mrpc`, `qqp`, and `hans`. More information about the different subsets of the GLUE dataset can be found on the [GLUE dataset page](https://huggingface.co/datasets/glue). 2. **Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one lists of references for each translation. ```python from evaluate import load glue_metric = load('glue', 'sst2') references = [0, 1] predictions = [0, 1] results = glue_metric.compute(predictions=predictions, references=references) ``` ## Output values The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics: `accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information). `f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall. `pearson`: a measure of the linear relationship between two datasets (see [Pearson correlation](https://huggingface.co/metrics/pearsonr) for more information). Its range is between -1 and +1, with 0 implying no correlation, and -1/+1 implying an exact linear relationship. Positive correlations imply that as x increases, so does y, whereas negative correlations imply that as x increases, y decreases. `spearmanr`: a nonparametric measure of the monotonicity of the relationship between two datasets(see [Spearman Correlation](https://huggingface.co/metrics/spearmanr) for more information). `spearmanr` has the same range as `pearson`. `matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The `cola` subset returns `matthews_correlation`, the `stsb` subset returns `pearson` and `spearmanr`, the `mrpc` and `qqp` subsets return both `accuracy` and `f1`, and all other subsets of GLUE return only accuracy. ### Values from popular papers The [original GLUE paper](https://huggingface.co/datasets/glue) reported average scores ranging from 58 to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible). For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/glue). ## Examples Maximal values for the MRPC subset (which outputs `accuracy` and `f1`): ```python from evaluate import load glue_metric = load('glue', 'mrpc') # 'mrpc' or 'qqp' references = [0, 1] predictions = [0, 1] results = glue_metric.compute(predictions=predictions, references=references) print(results) {'accuracy': 1.0, 'f1': 1.0} ``` Minimal values for the STSB subset (which outputs `pearson` and `spearmanr`): ```python from evaluate import load glue_metric = load('glue', 'stsb') references = [0., 1., 2., 3., 4., 5.] predictions = [-10., -11., -12., -13., -14., -15.] results = glue_metric.compute(predictions=predictions, references=references) print(results) {'pearson': -1.0, 'spearmanr': -1.0} ``` Partial match for the COLA subset (which outputs `matthews_correlation`) ```python from evaluate import load glue_metric = load('glue', 'cola') references = [0, 1] predictions = [1, 1] results = glue_metric.compute(predictions=predictions, references=references) results {'matthews_correlation': 0.0} ``` ## Limitations and bias This metric works only with datasets that have the same format as the [GLUE dataset](https://huggingface.co/datasets/glue). While the GLUE dataset is meant to represent "General Language Understanding", the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such. Also, while the GLUE subtasks were considered challenging during its creation in 2019, they are no longer considered as such given the impressive progress made since then. A more complex (or "stickier") version of it, called [SuperGLUE](https://huggingface.co/datasets/super_glue), was subsequently created. ## Citation ```bibtex @inproceedings{wang2019glue, title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding}, author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.}, note={In the Proceedings of ICLR.}, year={2019} } ``` ## Further References - [GLUE benchmark homepage](https://gluebenchmark.com/) - [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?)