lvwerra HF staff commited on
Commit
38343fb
1 Parent(s): 3832f2c

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +125 -5
  2. app.py +6 -0
  3. mauve.py +150 -0
  4. requirements.txt +6 -0
README.md CHANGED
@@ -1,12 +1,132 @@
1
  ---
2
- title: Mauve
3
- emoji: 👀
4
- colorFrom: yellow
5
- colorTo: gray
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: MAUVE
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for MAUVE
16
+
17
+ ## Metric description
18
+
19
+ MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
20
+
21
+ This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
22
+
23
+ For more details, consult the [MAUVE paper](https://arxiv.org/abs/2102.01454).
24
+
25
+
26
+ ## How to use
27
+
28
+ The metric takes two lists of strings of tokens separated by spaces: one representing `predictions` (i.e. the text generated by the model) and the second representing `references` (a reference text for each prediction):
29
+
30
+ ```python
31
+ from evaluate import load
32
+ mauve = load('mauve')
33
+ predictions = ["hello world", "goodnight moon"]
34
+ references = ["hello world", "goodnight moon"]
35
+ mauve_results = mauve.compute(predictions=predictions, references=references)
36
+ ```
37
+
38
+ It also has several optional arguments:
39
+
40
+ `num_buckets`: the size of the histogram to quantize P and Q. Options: `auto` (default) or an integer.
41
+
42
+ `pca_max_data`: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. The default is `-1`.
43
+
44
+ `kmeans_explained_var`: amount of variance of the data to keep in dimensionality reduction by PCA. The default is `0.9`.
45
+
46
+ `kmeans_num_redo`: number of times to redo k-means clustering (the best objective is kept). The default is `5`.
47
+
48
+ `kmeans_max_iter`: maximum number of k-means iterations. The default is `500`.
49
+
50
+ `featurize_model_name`: name of the model from which features are obtained, from one of the following: `gpt2`, `gpt2-medium`, `gpt2-large`, `gpt2-xl`. The default is `gpt2-large`.
51
+
52
+ `device_id`: Device for featurization. Supply a GPU id (e.g. `0` or `3`) to use GPU. If no GPU with this id is found, the metric will use CPU.
53
+
54
+ `max_text_length`: maximum number of tokens to consider. The default is `1024`.
55
+
56
+ `divergence_curve_discretization_size` Number of points to consider on the divergence curve. The default is `25`.
57
+
58
+ `mauve_scaling_factor`: Hyperparameter for scaling. The default is `5`.
59
+
60
+ `verbose`: If `True` (default), running the metric will print running time updates.
61
+
62
+ `seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
63
+
64
+
65
+
66
+ ## Output values
67
+
68
+ This metric outputs a dictionary with 5 key-value pairs:
69
+
70
+ `mauve`: MAUVE score, which ranges between 0 and 1. **Larger** values indicate that P and Q are closer.
71
+
72
+ `frontier_integral`: Frontier Integral, which ranges between 0 and 1. **Smaller** values indicate that P and Q are closer.
73
+
74
+ `divergence_curve`: a numpy.ndarray of shape (m, 2); plot it with `matplotlib` to view the divergence curve.
75
+
76
+ `p_hist`: a discrete distribution, which is a quantized version of the text distribution `p_text`.
77
+
78
+ `q_hist`: same as above, but with `q_text`.
79
+
80
+
81
+ ### Values from popular papers
82
+
83
+ The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.
84
+
85
+
86
+ ## Examples
87
+
88
+ Perfect match between prediction and reference:
89
+
90
+ ```python
91
+ from evaluate import load
92
+ mauve = load('mauve')
93
+ predictions = ["hello world", "goodnight moon"]
94
+ references = ["hello world", "goodnight moon"]
95
+ mauve_results = mauve.compute(predictions=predictions, references=references)
96
+ print(mauve_results.mauve)
97
+ 1.0
98
+ ```
99
+
100
+ Partial match between prediction and reference:
101
+
102
+ ```python
103
+ from evaluate import load
104
+ mauve = load('mauve')
105
+ predictions = ["hello world", "goodnight moon"]
106
+ references = ["hello there", "general kenobi"]
107
+ mauve_results = mauve.compute(predictions=predictions, references=references)
108
+ print(mauve_results.mauve)
109
+ 0.27811372536724027
110
+ ```
111
+
112
+ ## Limitations and bias
113
+
114
+ The [original MAUVE paper](https://arxiv.org/abs/2102.01454) did not analyze the inductive biases present in different embedding models, but related work has shown different kinds of biases exist in many popular generative language models including GPT-2 (see [Kirk et al., 2021](https://arxiv.org/pdf/2102.04130.pdf), [Abid et al., 2021](https://arxiv.org/abs/2101.05783)). The extent to which these biases can impact the MAUVE score has not been quantified.
115
+
116
+ Also, calculating the MAUVE metric involves downloading the model from which features are obtained -- the default model, `gpt2-large`, takes over 3GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `gpt` is 523MB.
117
+
118
+
119
+ ## Citation
120
+
121
+ ```bibtex
122
+ @inproceedings{pillutla-etal:mauve:neurips2021,
123
+ title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
124
+ author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
125
+ booktitle = {NeurIPS},
126
+ year = {2021}
127
+ }
128
+ ```
129
+
130
+ ## Further References
131
+ - [Official MAUVE implementation](https://github.com/krishnap25/mauve)
132
+ - [Hugging Face Tasks - Text Generation](https://huggingface.co/tasks/text-generation)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("mauve")
6
+ launch_gradio_widget(module)
mauve.py ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2020 The HuggingFace Evaluate Authors.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """ MAUVE metric from https://github.com/krishnap25/mauve. """
16
+
17
+ import datasets
18
+ import faiss # Here to have a nice missing dependency error message early on
19
+ import numpy # Here to have a nice missing dependency error message early on
20
+ import requests # Here to have a nice missing dependency error message early on
21
+ import sklearn # Here to have a nice missing dependency error message early on
22
+ import tqdm # Here to have a nice missing dependency error message early on
23
+ from mauve import compute_mauve # From: mauve-text
24
+
25
+ import evaluate
26
+
27
+
28
+ _CITATION = """\
29
+ @inproceedings{pillutla-etal:mauve:neurips2021,
30
+ title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
31
+ author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
32
+ booktitle = {NeurIPS},
33
+ year = {2021}
34
+ }
35
+
36
+ """
37
+
38
+ _DESCRIPTION = """\
39
+ MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure.
40
+
41
+ MAUVE summarizes both Type I and Type II errors measured softly using Kullback–Leibler (KL) divergences.
42
+
43
+ For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (Neurips, 2021).
44
+
45
+ This metrics is a wrapper around the official implementation of MAUVE:
46
+ https://github.com/krishnap25/mauve
47
+ """
48
+
49
+ _KWARGS_DESCRIPTION = """
50
+ Calculates MAUVE scores between two lists of generated text and reference text.
51
+ Args:
52
+ predictions: list of generated text to score. Each predictions
53
+ should be a string with tokens separated by spaces.
54
+ references: list of reference for each prediction. Each
55
+ reference should be a string with tokens separated by spaces.
56
+ Optional Args:
57
+ num_buckets: the size of the histogram to quantize P and Q. Options: 'auto' (default) or an integer
58
+ pca_max_data: the number data points to use for PCA dimensionality reduction prior to clustering. If -1, use all the data. Default -1
59
+ kmeans_explained_var: amount of variance of the data to keep in dimensionality reduction by PCA. Default 0.9
60
+ kmeans_num_redo: number of times to redo k-means clustering (the best objective is kept). Default 5
61
+ kmeans_max_iter: maximum number of k-means iterations. Default 500
62
+ featurize_model_name: name of the model from which features are obtained. Default 'gpt2-large' Use one of ['gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'].
63
+ device_id: Device for featurization. Supply a GPU id (e.g. 0 or 3) to use GPU. If no GPU with this id is found, use CPU
64
+ max_text_length: maximum number of tokens to consider. Default 1024
65
+ divergence_curve_discretization_size: Number of points to consider on the divergence curve. Default 25
66
+ mauve_scaling_factor: "c" from the paper. Default 5.
67
+ verbose: If True (default), print running time updates
68
+ seed: random seed to initialize k-means cluster assignments.
69
+ Returns:
70
+ mauve: MAUVE score, a number between 0 and 1. Larger values indicate that P and Q are closer,
71
+ frontier_integral: Frontier Integral, a number between 0 and 1. Smaller values indicate that P and Q are closer,
72
+ divergence_curve: a numpy.ndarray of shape (m, 2); plot it with matplotlib to view the divergence curve,
73
+ p_hist: a discrete distribution, which is a quantized version of the text distribution p_text,
74
+ q_hist: same as above, but with q_text.
75
+ Examples:
76
+
77
+ >>> # faiss segfaults in doctest for some reason, so the .compute call is not tested with doctest
78
+ >>> import evaluate
79
+ >>> mauve = evaluate.load('mauve')
80
+ >>> predictions = ["hello there", "general kenobi"]
81
+ >>> references = ["hello there", "general kenobi"]
82
+ >>> out = mauve.compute(predictions=predictions, references=references) # doctest: +SKIP
83
+ >>> print(out.mauve) # doctest: +SKIP
84
+ 1.0
85
+ """
86
+
87
+
88
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
89
+ class Mauve(evaluate.EvaluationModule):
90
+ def _info(self):
91
+ return evaluate.EvaluationModuleInfo(
92
+ description=_DESCRIPTION,
93
+ citation=_CITATION,
94
+ homepage="https://github.com/krishnap25/mauve",
95
+ inputs_description=_KWARGS_DESCRIPTION,
96
+ features=datasets.Features(
97
+ {
98
+ "predictions": datasets.Value("string", id="sequence"),
99
+ "references": datasets.Value("string", id="sequence"),
100
+ }
101
+ ),
102
+ codebase_urls=["https://github.com/krishnap25/mauve"],
103
+ reference_urls=[
104
+ "https://arxiv.org/abs/2102.01454",
105
+ "https://github.com/krishnap25/mauve",
106
+ ],
107
+ )
108
+
109
+ def _compute(
110
+ self,
111
+ predictions,
112
+ references,
113
+ p_features=None,
114
+ q_features=None,
115
+ p_tokens=None,
116
+ q_tokens=None,
117
+ num_buckets="auto",
118
+ pca_max_data=-1,
119
+ kmeans_explained_var=0.9,
120
+ kmeans_num_redo=5,
121
+ kmeans_max_iter=500,
122
+ featurize_model_name="gpt2-large",
123
+ device_id=-1,
124
+ max_text_length=1024,
125
+ divergence_curve_discretization_size=25,
126
+ mauve_scaling_factor=5,
127
+ verbose=True,
128
+ seed=25,
129
+ ):
130
+ out = compute_mauve(
131
+ p_text=predictions,
132
+ q_text=references,
133
+ p_features=p_features,
134
+ q_features=q_features,
135
+ p_tokens=p_tokens,
136
+ q_tokens=q_tokens,
137
+ num_buckets=num_buckets,
138
+ pca_max_data=pca_max_data,
139
+ kmeans_explained_var=kmeans_explained_var,
140
+ kmeans_num_redo=kmeans_num_redo,
141
+ kmeans_max_iter=kmeans_max_iter,
142
+ featurize_model_name=featurize_model_name,
143
+ device_id=device_id,
144
+ max_text_length=max_text_length,
145
+ divergence_curve_discretization_size=divergence_curve_discretization_size,
146
+ mauve_scaling_factor=mauve_scaling_factor,
147
+ verbose=verbose,
148
+ seed=seed,
149
+ )
150
+ return out
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ faiss-cpu
5
+ sklearn
6
+ mauve-text