lvwerra HF staff commited on
Commit
726b567
·
1 Parent(s): 9dc5134

Update Space (evaluate main: 940d6dee)

Browse files
Files changed (3) hide show
  1. README.md +12 -11
  2. perplexity.py +4 -4
  3. requirements.txt +1 -1
README.md CHANGED
@@ -12,9 +12,9 @@ tags:
12
  - metric
13
  description: >-
14
  Perplexity (PPL) is one of the most common metrics for evaluating language models.
15
- It is defined as the exponentiated average negative log-likelihood of a sequence.
16
 
17
- For more information, see https://huggingface.co/docs/transformers/perplexity
18
  ---
19
 
20
  # Metric Card for Perplexity
@@ -22,10 +22,11 @@ description: >-
22
  ## Metric Description
23
  Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.
24
 
25
- As a metric, it can be used to evaluate how well the model has learned the distribution of the text it was trained on
26
 
 
27
 
28
- In this case, the model input should be the trained model to be evaluated, and the input texts should be the text that the model was trained on.
29
 
30
  ## Intended Uses
31
  Any language generation task.
@@ -43,10 +44,10 @@ results = perplexity.compute(predictions=predictions, model_id='gpt2')
43
  ### Inputs
44
  - **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
45
  - This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
46
- - **predictions** (list of str): input text, each separate text snippet is one list entry.
47
  - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
48
  - **add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
49
- - **device** (str): device to run on, defaults to 'cuda' when available
50
 
51
  ### Output Values
52
  This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
@@ -56,7 +57,7 @@ If one of the input texts is longer than the max input length of the model, then
56
  {'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
57
  ```
58
 
59
- This metric's range is 0 and up. A lower score is better.
60
 
61
  #### Values from Popular Papers
62
 
@@ -72,9 +73,9 @@ results = perplexity.compute(model_id='gpt2',
72
  print(list(results.keys()))
73
  >>>['perplexities', 'mean_perplexity']
74
  print(round(results["mean_perplexity"], 2))
75
- >>>78.22
76
  print(round(results["perplexities"][0], 2))
77
- >>>11.11
78
  ```
79
  Calculating perplexity on predictions loaded in from a dataset:
80
  ```python
@@ -88,9 +89,9 @@ results = perplexity.compute(model_id='gpt2',
88
  print(list(results.keys()))
89
  >>>['perplexities', 'mean_perplexity']
90
  print(round(results["mean_perplexity"], 2))
91
- >>>60.35
92
  print(round(results["perplexities"][0], 2))
93
- >>>81.12
94
  ```
95
 
96
  ## Limitations and Bias
 
12
  - metric
13
  description: >-
14
  Perplexity (PPL) is one of the most common metrics for evaluating language models.
15
+ It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
16
 
17
+ For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).
18
  ---
19
 
20
  # Metric Card for Perplexity
 
22
  ## Metric Description
23
  Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.
24
 
25
+ As a metric, it can be used to evaluate how well the model has learned the distribution of the text it was trained on.
26
 
27
+ In this case, `model_id` should be the trained model to be evaluated, and the input texts should be the text that the model was trained on.
28
 
29
+ This implementation of perplexity is calculated with log base `e`, as in `perplexity = e**(sum(losses) / num_tokenized_tokens)`, following recent convention in deep learning frameworks.
30
 
31
  ## Intended Uses
32
  Any language generation task.
 
44
  ### Inputs
45
  - **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
46
  - This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
47
+ - **predictions** (list of str): input text, where each separate text snippet is one list entry.
48
  - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
49
  - **add_start_token** (bool): whether to add the start token to the texts, so the perplexity can include the probability of the first word. Defaults to True.
50
+ - **device** (str): device to run on, defaults to `cuda` when available
51
 
52
  ### Output Values
53
  This metric outputs a dictionary with the perplexity scores for the text input in the list, and the average perplexity.
 
57
  {'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
58
  ```
59
 
60
+ The range of this metric is [0, inf). A lower score is better.
61
 
62
  #### Values from Popular Papers
63
 
 
73
  print(list(results.keys()))
74
  >>>['perplexities', 'mean_perplexity']
75
  print(round(results["mean_perplexity"], 2))
76
+ >>>646.74
77
  print(round(results["perplexities"][0], 2))
78
+ >>>32.25
79
  ```
80
  Calculating perplexity on predictions loaded in from a dataset:
81
  ```python
 
89
  print(list(results.keys()))
90
  >>>['perplexities', 'mean_perplexity']
91
  print(round(results["mean_perplexity"], 2))
92
+ >>>576.76
93
  print(round(results["perplexities"][0], 2))
94
+ >>>889.28
95
  ```
96
 
97
  ## Limitations and Bias
perplexity.py CHANGED
@@ -29,7 +29,7 @@ _CITATION = """\
29
 
30
  _DESCRIPTION = """
31
  Perplexity (PPL) is one of the most common metrics for evaluating language models.
32
- It is defined as the exponentiated average negative log-likelihood of a sequence.
33
 
34
  For more information, see https://huggingface.co/docs/transformers/perplexity
35
  """
@@ -78,9 +78,9 @@ Examples:
78
  >>> print(list(results.keys()))
79
  ['perplexities', 'mean_perplexity']
80
  >>> print(round(results["mean_perplexity"], 2)) # doctest: +SKIP
81
- 60.35
82
  >>> print(round(results["perplexities"][0], 2)) # doctest: +SKIP
83
- 81.12
84
  """
85
 
86
 
@@ -180,7 +180,7 @@ class Perplexity(evaluate.Metric):
180
  shift_labels = labels[..., 1:].contiguous()
181
  shift_attention_mask_batch = attn_mask[..., 1:].contiguous()
182
 
183
- perplexity_batch = torch.exp2(
184
  (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)
185
  / shift_attention_mask_batch.sum(1)
186
  )
 
29
 
30
  _DESCRIPTION = """
31
  Perplexity (PPL) is one of the most common metrics for evaluating language models.
32
+ It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.
33
 
34
  For more information, see https://huggingface.co/docs/transformers/perplexity
35
  """
 
78
  >>> print(list(results.keys()))
79
  ['perplexities', 'mean_perplexity']
80
  >>> print(round(results["mean_perplexity"], 2)) # doctest: +SKIP
81
+ 576.76
82
  >>> print(round(results["perplexities"][0], 2)) # doctest: +SKIP
83
+ 889.28
84
  """
85
 
86
 
 
180
  shift_labels = labels[..., 1:].contiguous()
181
  shift_attention_mask_batch = attn_mask[..., 1:].contiguous()
182
 
183
+ perplexity_batch = torch.exp(
184
  (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)
185
  / shift_attention_mask_batch.sum(1)
186
  )
requirements.txt CHANGED
@@ -1,4 +1,4 @@
1
- git+https://github.com/huggingface/evaluate@288e417936483b1b48ac2dc64a9d9c80ae0ed7e6
2
  torch
3
  torch
4
  transformers
 
1
+ git+https://github.com/huggingface/evaluate@940d6dee3b4a23eabb0c81e4117c9533cd7c458a
2
  torch
3
  torch
4
  transformers