Update the description in README

#2
by krishnap25 - opened
Files changed (1) hide show
  1. README.md +15 -6
README.md CHANGED
@@ -11,11 +11,13 @@ tags:
11
  - evaluate
12
  - metric
13
  description: >-
14
- MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure.
15
 
16
- MAUVE summarizes both Type I and Type II errors measured softly using Kullback–Leibler (KL) divergences.
17
 
18
- For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (Neurips, 2021).
 
 
19
 
20
  This metrics is a wrapper around the official implementation of MAUVE:
21
  https://github.com/krishnap25/mauve
@@ -25,7 +27,7 @@ description: >-
25
 
26
  ## Metric description
27
 
28
- MAUVE is a library built on PyTorch and HuggingFace Transformers to measure the gap between neural text and human text with the eponymous MAUVE measure. It summarizes both Type I and Type II errors measured softly using [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).
29
 
30
  This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
31
 
@@ -69,7 +71,6 @@ It also has several optional arguments:
69
  `verbose`: If `True` (default), running the metric will print running time updates.
70
 
71
  `seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
72
-
73
 
74
 
75
  ## Output values
@@ -89,7 +90,15 @@ This metric outputs a dictionary with 5 key-value pairs:
89
 
90
  ### Values from popular papers
91
 
92
- The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain. The authors found that bigger models resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.
 
 
 
 
 
 
 
 
93
 
94
 
95
  ## Examples
 
11
  - evaluate
12
  - metric
13
  description: >-
14
+ MAUVE is a measure of the gap between two text distributions, e.g., how far the text written by a model is the distribution of human text, using samples from both distributions.
15
 
16
+ MAUVE takes values between 0 (completely different distributions) and 1 (identical distributions).
17
 
18
+ MAUVE is obtained by computing Kullback–Leibler (KL) divergences divergences between the to distributions in a quantized embedding space of a large language model. It can quantify differences in the quality of generated text based on the size of the model, decoding algorithm, and the length of the generated text. MAUVE was found to correlate the strongest with human evaluations over baseline metrics for open-ended text generation.
19
+
20
+ For details, see the MAUVE paper: https://arxiv.org/abs/2102.01454 (NeurIPS, 2021).
21
 
22
  This metrics is a wrapper around the official implementation of MAUVE:
23
  https://github.com/krishnap25/mauve
 
27
 
28
  ## Metric description
29
 
30
+ MAUVE is a measure of the gap between neural text and human text. It is computed using the [Kullback–Leibler (KL) divergences](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between the two distributions of text in a quantized embedding space of a large language model. MAUVE can identify differences in quality arising from model sizes and decoding algorithms.
31
 
32
  This metric is a wrapper around the [official implementation](https://github.com/krishnap25/mauve) of MAUVE.
33
 
 
71
  `verbose`: If `True` (default), running the metric will print running time updates.
72
 
73
  `seed`: random seed to initialize k-means cluster assignments, randomly assigned by default.
 
74
 
75
 
76
  ## Output values
 
90
 
91
  ### Values from popular papers
92
 
93
+ The [original MAUVE paper](https://arxiv.org/abs/2102.01454) reported values ranging from 0.88 to 0.94 for open-ended text generation using a text completion task in the web text domain (computed using 5000 continuations 1024-tokens long with default hyperparameters). The authors found that bigger models generally resulted in higher MAUVE scores, and that MAUVE is correlated with human judgments.
94
+
95
+ ### Best practices
96
+
97
+ It is a good idea to use at least 500-1000 samples for each distribution to compute MAUVE.
98
+
99
+ MAUVE is unable to identify very small differences between different settings of generation (e.g., between top-p sampling with p=0.95 versus 0.96). It is important, therefore, to account for the randomness inside the generation (e.g., due to sampling) and within the MAUVE estimation procedure (see the `seed` parameter above).
100
+
101
+ Therefore, it is a good idea to obtain generations using multiple random seeds and/or to use rerun MAUVE with multiple values of the parameter `seed`.
102
 
103
 
104
  ## Examples