lvwerra HF staff commited on
Commit
caa57cc
1 Parent(s): 3ea084c

Update Space (evaluate main: 940d6dee)

Browse files
Files changed (3) hide show
  1. README.md +18 -16
  2. perplexity.py +6 -6
  3. requirements.txt +1 -1
README.md CHANGED
@@ -11,10 +11,10 @@ tags:
11
  - evaluate
12
  - measurement
13
  description: >-
14
- Perplexity (PPL) can be used for evaluating to what extent a dataset is similar to the distribution of text that a given model was trained on.
15
- It is defined as the exponentiated average negative log-likelihood of a sequence.
16
 
17
- For more information, see https://huggingface.co/docs/transformers/perplexity
18
  ---
19
 
20
  # Measurement Card for Perplexity
@@ -22,8 +22,10 @@ description: >-
22
  ## Measurement Description
23
  Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.
24
 
25
- As a measurement, it can be used to to evaluate how well a selection of texts matches the distribution of text that the input model was trained on.
26
- In this case, the model input should be a trained model, and the input texts should be the text to be evaluated.
 
 
27
 
28
  ## Intended Uses
29
  Dataset analysis or exploration.
@@ -35,16 +37,16 @@ The measurement takes a list of texts as input, as well as the name of the model
35
  ```python
36
  from evaluate import load
37
  perplexity = load("perplexity", module_type= "measurement")
38
- results = perplexity.compute(input_texts=input_texts, model_id='gpt2')
39
  ```
40
 
41
  ### Inputs
42
  - **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
43
  - This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
44
- - **input_texts** (list of str): input text, each separate text snippet is one list entry.
45
  - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
46
  - **add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
47
- - **device** (str): device to run on, defaults to 'cuda' when available
48
 
49
  ### Output Values
50
  This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
@@ -54,7 +56,7 @@ If one of the input texts is longer than the max input length of the model, then
54
  {'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
55
  ```
56
 
57
- This metric's range is 0 and up. A lower score is better.
58
 
59
  #### Values from Popular Papers
60
 
@@ -62,17 +64,17 @@ This metric's range is 0 and up. A lower score is better.
62
  ### Examples
63
  Calculating perplexity on input_texts defined here:
64
  ```python
65
- perplexity = evaluate.load("perplexity", module_type= "measurement")
66
  input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
67
  results = perplexity.compute(model_id='gpt2',
68
  add_start_token=False,
69
- input_texts=input_texts)
70
  print(list(results.keys()))
71
  >>>['perplexities', 'mean_perplexity']
72
  print(round(results["mean_perplexity"], 2))
73
- >>>78.22
74
  print(round(results["perplexities"][0], 2))
75
- >>>11.11
76
  ```
77
  Calculating perplexity on input_texts loaded in from a dataset:
78
  ```python
@@ -82,13 +84,13 @@ input_texts = datasets.load_dataset("wikitext",
82
  split="test")["text"][:50]
83
  input_texts = [s for s in input_texts if s!='']
84
  results = perplexity.compute(model_id='gpt2',
85
- input_texts=input_texts)
86
  print(list(results.keys()))
87
  >>>['perplexities', 'mean_perplexity']
88
  print(round(results["mean_perplexity"], 2))
89
- >>>60.35
90
  print(round(results["perplexities"][0], 2))
91
- >>>81.12
92
  ```
93
 
94
  ## Limitations and Bias
 
11
  - evaluate
12
  - measurement
13
  description: >-
14
+ Perplexity (PPL) can be used to evaluate the extent to which a dataset is similar to the distribution of text that a given model was trained on.
15
+ It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
16
 
17
+ For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).
18
  ---
19
 
20
  # Measurement Card for Perplexity
 
22
  ## Measurement Description
23
  Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.
24
 
25
+ As a measurement, it can be used to evaluate how well text matches the distribution of text that the input model was trained on.
26
+ In this case, `model_id` should be the trained model, and `data` should be the text to be evaluated.
27
+
28
+ This implementation of perplexity is calculated with log base `e`, as in `perplexity = e**(sum(losses) / num_tokenized_tokens)`, following recent convention in deep learning frameworks.
29
 
30
  ## Intended Uses
31
  Dataset analysis or exploration.
 
37
  ```python
38
  from evaluate import load
39
  perplexity = load("perplexity", module_type= "measurement")
40
+ results = perplexity.compute(data=input_texts, model_id='gpt2')
41
  ```
42
 
43
  ### Inputs
44
  - **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
45
  - This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
46
+ - **data** (list of str): input text, where each separate text snippet is one list entry.
47
  - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
48
  - **add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
49
+ - **device** (str): device to run on, defaults to `cuda` when available
50
 
51
  ### Output Values
52
  This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
 
56
  {'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
57
  ```
58
 
59
+ The range of this metric is [0, inf). A lower score is better.
60
 
61
  #### Values from Popular Papers
62
 
 
64
  ### Examples
65
  Calculating perplexity on input_texts defined here:
66
  ```python
67
+ perplexity = evaluate.load("perplexity", module_type="measurement")
68
  input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
69
  results = perplexity.compute(model_id='gpt2',
70
  add_start_token=False,
71
+ data=input_texts)
72
  print(list(results.keys()))
73
  >>>['perplexities', 'mean_perplexity']
74
  print(round(results["mean_perplexity"], 2))
75
+ >>>646.74
76
  print(round(results["perplexities"][0], 2))
77
+ >>>32.25
78
  ```
79
  Calculating perplexity on input_texts loaded in from a dataset:
80
  ```python
 
84
  split="test")["text"][:50]
85
  input_texts = [s for s in input_texts if s!='']
86
  results = perplexity.compute(model_id='gpt2',
87
+ data=input_texts)
88
  print(list(results.keys()))
89
  >>>['perplexities', 'mean_perplexity']
90
  print(round(results["mean_perplexity"], 2))
91
+ >>>576.76
92
  print(round(results["perplexities"][0], 2))
93
+ >>>889.28
94
  ```
95
 
96
  ## Limitations and Bias
perplexity.py CHANGED
@@ -29,7 +29,7 @@ _CITATION = """\
29
 
30
  _DESCRIPTION = """
31
  Perplexity (PPL) can be used for evaluating to what extent a dataset is similar to the distribution of text that a given model was trained on.
32
- It is defined as the exponentiated average negative log-likelihood of a sequence.
33
 
34
  For more information, see https://huggingface.co/docs/transformers/perplexity
35
  """
@@ -64,9 +64,9 @@ Examples:
64
  >>> print(list(results.keys()))
65
  ['perplexities', 'mean_perplexity']
66
  >>> print(round(results["mean_perplexity"], 2))
67
- 78.22
68
  >>> print(round(results["perplexities"][0], 2))
69
- 11.11
70
 
71
  Example 2:
72
  >>> from datasets import load_dataset
@@ -78,9 +78,9 @@ Examples:
78
  >>> print(list(results.keys()))
79
  ['perplexities', 'mean_perplexity']
80
  >>> print(round(results["mean_perplexity"], 2)) # doctest: +SKIP
81
- 60.35
82
  >>> print(round(results["perplexities"][0], 2)) # doctest: +SKIP
83
- 81.12
84
  """
85
 
86
 
@@ -180,7 +180,7 @@ class Perplexity(evaluate.Measurement):
180
  shift_labels = labels[..., 1:].contiguous()
181
  shift_attention_mask_batch = attn_mask[..., 1:].contiguous()
182
 
183
- perplexity_batch = torch.exp2(
184
  (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)
185
  / shift_attention_mask_batch.sum(1)
186
  )
 
29
 
30
  _DESCRIPTION = """
31
  Perplexity (PPL) can be used for evaluating to what extent a dataset is similar to the distribution of text that a given model was trained on.
32
+ It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
33
 
34
  For more information, see https://huggingface.co/docs/transformers/perplexity
35
  """
 
64
  >>> print(list(results.keys()))
65
  ['perplexities', 'mean_perplexity']
66
  >>> print(round(results["mean_perplexity"], 2))
67
+ 646.74
68
  >>> print(round(results["perplexities"][0], 2))
69
+ 32.25
70
 
71
  Example 2:
72
  >>> from datasets import load_dataset
 
78
  >>> print(list(results.keys()))
79
  ['perplexities', 'mean_perplexity']
80
  >>> print(round(results["mean_perplexity"], 2)) # doctest: +SKIP
81
+ 576.76
82
  >>> print(round(results["perplexities"][0], 2)) # doctest: +SKIP
83
+ 889.28
84
  """
85
 
86
 
 
180
  shift_labels = labels[..., 1:].contiguous()
181
  shift_attention_mask_batch = attn_mask[..., 1:].contiguous()
182
 
183
+ perplexity_batch = torch.exp(
184
  (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)
185
  / shift_attention_mask_batch.sum(1)
186
  )
requirements.txt CHANGED
@@ -1,3 +1,3 @@
1
- git+https://github.com/huggingface/evaluate@288e417936483b1b48ac2dc64a9d9c80ae0ed7e6
2
  torch
3
  transformers
 
1
+ git+https://github.com/huggingface/evaluate@940d6dee3b4a23eabb0c81e4117c9533cd7c458a
2
  torch
3
  transformers