lvwerra HF staff commited on
Commit
8afb9b6
1 Parent(s): 50f7fee

Update Space (evaluate main: 1bb5f431)

Browse files
Files changed (4) hide show
  1. README.md +139 -5
  2. app.py +6 -0
  3. regard.py +180 -0
  4. requirements.txt +2 -0
README.md CHANGED
@@ -1,12 +1,146 @@
1
  ---
2
  title: Regard
3
- emoji: 🦀
4
- colorFrom: indigo
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 3.3.1
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Regard
3
+ emoji: 🤗
4
+ colorFrom: green
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - measurement
13
+ description: >-
14
+ Regard aims to measure language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
15
  ---
16
 
17
+ # Measurement Card for Regard
18
+
19
+
20
+ ## Measurement Description
21
+
22
+ The `regard` measurement returns the estimated language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
23
+
24
+ It uses a model trained on labelled data from the paper ["The Woman Worked as a Babysitter: On Biases in Language Generation" (EMNLP 2019)](https://arxiv.org/abs/1909.01326)
25
+
26
+ ## How to Use
27
+
28
+ This measurement requires two lists of strings as input, enabling comparing the estimated polarity between the groups.
29
+
30
+ ```python
31
+ >>> regard = evaluate.load("regard", module_type="measurement")
32
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
33
+ >>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
34
+ >>> regard.compute(data = group1, references = group2)
35
+ ```
36
+
37
+ ### Inputs
38
+ - **data** (list of `str`): prediction/candidate sentences, e.g. sentences describing a given demographic group.
39
+ - **references** (list of `str`) (optional): reference/comparison sentences, e.g. sentences describing a different demographic group to compare against.
40
+ - **aggregation** (`str`) (optional): determines the type of aggregation performed.
41
+ If set to `None`, the difference between the regard scores for the two categories is returned.
42
+ Otherwise:
43
+ - `average` : returns the average regard for each category (negative, positive, neutral, other) for each group
44
+ - `maximum`: returns the maximum regard for each group
45
+
46
+ ### Output Values
47
+
48
+ **With a single input**:
49
+
50
+ `regard` : the regard scores of each string in the input list (if no aggregation is specified)
51
+ ```python
52
+ {'neutral': 0.95, 'positive': 0.02, 'negative': 0.02, 'other': 0.01}
53
+ {'negative': 0.97, 'other': 0.02, 'neutral': 0.01, 'positive': 0.0}
54
+ ```
55
+
56
+ `average_regard`: the average regard for each category (negative, positive, neutral, other) (if `aggregation` = `average`)
57
+ ```python
58
+ {'neutral': 0.48, 'positive': 0.01, 'negative': 0.5, 'other': 0.01}
59
+ ```
60
+
61
+ `max_regard`: the maximum regard across all input strings (if `aggregation` = `maximum`)
62
+ ```python
63
+ {'neutral': 0.95, 'positive': 0.024, 'negative': 0.972, 'other': 0.019}
64
+ ```
65
+
66
+ **With two lists of inputs**:
67
+
68
+ By default, this measurement outputs a dictionary containing a list of regard scores, one for each category (negative, positive, neutral, other), representing the difference in regard between the two groups.
69
+
70
+ ```python
71
+ {'neutral': 0.35, 'negative': -0.36, 'other': 0.01, 'positive': 0.01}
72
+ ```
73
+
74
+ With the `aggregation='maximum'` option, this measurement will output the maximum regard for each group:
75
+
76
+ ```python
77
+ {'negative': 0.98, 'other': 0.04, 'neutral': 0.03, 'positive': 0.0}
78
+ ```
79
+
80
+ With the `aggregation='average'` option, this measurement will output the average regard for each category (negative, positive, neutral, other):
81
+
82
+ ```python
83
+ {'neutral': 0.37, 'negative': 0.57, 'other': 0.05, 'positive': 0.01}
84
+ ```
85
+
86
+ ### Examples
87
+
88
+ Example 1 (single input):
89
+
90
+ ```python
91
+ >>> regard = evaluate.load("regard")
92
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
93
+ >>> results = regard.compute(data = group1)
94
+ >>> for d in results['regard']:
95
+ ... print({l['label']: round(l['score'],2) for l in d})
96
+ {'neutral': 0.95, 'positive': 0.02, 'negative': 0.02, 'other': 0.01}
97
+ {'negative': 0.97, 'other': 0.02, 'neutral': 0.01, 'positive': 0.0}
98
+ ```
99
+
100
+ Example 2 (comparison mode):
101
+ ```python
102
+ >>> regard = evaluate.load("regard", "compare")
103
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
104
+ >>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
105
+ >>> results = regard.compute(data = group1, references = group2)
106
+ >>> print({k: round(v, 2) for k, v in results['regard_difference'].items()})
107
+ {'neutral': 0.46, 'positive': 0.01, 'negative': -0.46, 'other': -0.01}
108
+ ```
109
+
110
+ Example 3 (returns the maximum regard score):
111
+ ```python
112
+ >>> regard = evaluate.load("regard", "compare")
113
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
114
+ >>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
115
+ >>> results = regard.compute(data = group1, references = group2, aggregation = "maximum")
116
+ >>> print({k: round(v, 2) for k, v in results['max_data_regard'].items()})
117
+ {'neutral': 0.95, 'positive': 0.02, 'negative': 0.97, 'other': 0.02}
118
+ >>> print({k: round(v, 2) for k, v in results['max_references_regard'].items()})
119
+ {'negative': 0.98, 'other': 0.04, 'neutral': 0.03, 'positive': 0.0}
120
+ ```
121
+
122
+ Example 4 (returns the average regard score):
123
+ ```python
124
+ >>> regard = evaluate.load("regard", "compare")
125
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
126
+ >>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
127
+ >>> results = regard.compute(data = group1, references = group2, aggregation = "average")
128
+ >>> print({k: round(v, 2) for k, v in results['average_data_regard'].items()})
129
+ {'neutral': 0.48, 'positive': 0.01, 'negative': 0.5, 'other': 0.01}
130
+ >>> print({k: round(v, 2) for k, v in results['average_references_regard'].items()})
131
+ {'negative': 0.96, 'other': 0.02, 'neutral': 0.02, 'positive': 0.0}
132
+ ```
133
+
134
+ ## Citation(s)
135
+ @article{https://doi.org/10.48550/arxiv.1909.01326,
136
+ doi = {10.48550/ARXIV.1909.01326},
137
+ url = {https://arxiv.org/abs/1909.01326},
138
+ author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
139
+ title = {The Woman Worked as a Babysitter: On Biases in Language Generation},
140
+ publisher = {arXiv},
141
+ year = {2019}
142
+ }
143
+
144
+
145
+ ## Further References
146
+ - [`nlg-bias` library](https://github.com/ewsheng/nlg-bias/)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("regard")
6
+ launch_gradio_widget(module)
regard.py ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ """ Regard measurement. """
16
+
17
+ from collections import defaultdict
18
+ from operator import itemgetter
19
+ from statistics import mean
20
+
21
+ import datasets
22
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
23
+
24
+ import evaluate
25
+
26
+
27
+ logger = evaluate.logging.get_logger(__name__)
28
+
29
+
30
+ _CITATION = """
31
+ @article{https://doi.org/10.48550/arxiv.1909.01326,
32
+ doi = {10.48550/ARXIV.1909.01326},
33
+ url = {https://arxiv.org/abs/1909.01326},
34
+ author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun},
35
+ title = {The Woman Worked as a Babysitter: On Biases in Language Generation},
36
+ publisher = {arXiv},
37
+ year = {2019}
38
+ }
39
+
40
+ """
41
+
42
+ _DESCRIPTION = """\
43
+ Regard aims to measure language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
44
+ """
45
+
46
+ _KWARGS_DESCRIPTION = """
47
+ Compute the regard of the input sentences.
48
+
49
+ Args:
50
+ `data` (list of str): prediction/candidate sentences, e.g. sentences describing a given demographic group.
51
+ `references` (list of str) (optional): reference/comparison sentences, e.g. sentences describing a different demographic group to compare against.
52
+ `aggregation` (str) (optional): determines the type of aggregation performed.
53
+ If set to `None`, the difference between the regard scores for the two categories is returned.
54
+ Otherwise:
55
+ - 'average' : returns the average regard for each category (negative, positive, neutral, other) for each group
56
+ - 'maximum': returns the maximum regard for each group
57
+
58
+ Returns:
59
+ With only `data` as input (default config):
60
+ `regard` : the regard scores of each string in the input list (if no aggregation is specified)
61
+ `average_regard`: the average regard for each category (negative, positive, neutral, other) (if `aggregation` = `average`)
62
+ `max_regard`: the maximum regard across all input strings (if `aggregation` = `maximum`)
63
+ With `data` and `references` as input (`compare` config):
64
+ `regard_difference`: the difference between the regard scores for the two groups (if no aggregation is specified)
65
+ `average_data_regard` and 'average_references_regard': the average regard for each category (negative, positive, neutral, other) (if `aggregation` = `average`)
66
+ `max_data_regard` and 'max_references_regard': the maximum regard for each group (if `aggregation` = `maximum`)
67
+
68
+ Examples:
69
+
70
+ Example 1 (single input):
71
+ >>> regard = evaluate.load("regard")
72
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
73
+ >>> results = regard.compute(data = group1)
74
+ >>> for d in results['regard']:
75
+ ... print({l['label']: round(l['score'],2) for l in d})
76
+ {'neutral': 0.95, 'positive': 0.02, 'negative': 0.02, 'other': 0.01}
77
+ {'negative': 0.97, 'other': 0.02, 'neutral': 0.01, 'positive': 0.0}
78
+
79
+ Example 2 (comparison mode):
80
+ >>> regard = evaluate.load("regard", "compare")
81
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
82
+ >>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
83
+ >>> results = regard.compute(data = group1, references = group2)
84
+ >>> print({k: round(v, 2) for k, v in results['regard_difference'].items()})
85
+ {'neutral': 0.46, 'positive': 0.01, 'negative': -0.46, 'other': -0.01}
86
+
87
+ Example 3 (returns the maximum regard score per category):
88
+ >>> regard = evaluate.load("regard", "compare")
89
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
90
+ >>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
91
+ >>> results = regard.compute(data = group1, references = group2, aggregation = "maximum")
92
+ >>> print({k: round(v, 2) for k, v in results['max_data_regard'].items()})
93
+ {'neutral': 0.95, 'positive': 0.02, 'negative': 0.97, 'other': 0.02}
94
+ >>> print({k: round(v, 2) for k, v in results['max_references_regard'].items()})
95
+ {'negative': 0.98, 'other': 0.04, 'neutral': 0.03, 'positive': 0.0}
96
+
97
+ Example 4 (returns the average regard score):
98
+ >>> regard = evaluate.load("regard", "compare")
99
+ >>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
100
+ >>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
101
+ >>> results = regard.compute(data = group1, references = group2, aggregation = "average")
102
+ >>> print({k: round(v, 2) for k, v in results['average_data_regard'].items()})
103
+ {'neutral': 0.48, 'positive': 0.01, 'negative': 0.5, 'other': 0.01}
104
+ >>> print({k: round(v, 2) for k, v in results['average_references_regard'].items()})
105
+ {'negative': 0.96, 'other': 0.02, 'neutral': 0.02, 'positive': 0.0}
106
+ """
107
+
108
+
109
+ def regard(group, regard_classifier):
110
+ group_scores = defaultdict(list)
111
+ group_regard = regard_classifier(group)
112
+ for pred in group_regard:
113
+ for pred_score in pred:
114
+ group_scores[pred_score["label"]].append(pred_score["score"])
115
+ return group_regard, dict(group_scores)
116
+
117
+
118
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
119
+ class Regard(evaluate.Measurement):
120
+ def _info(self):
121
+ if self.config_name not in ["compare", "default"]:
122
+ raise KeyError("You should supply a configuration name selected in " '["config", "default"]')
123
+ return evaluate.MeasurementInfo(
124
+ module_type="measurement",
125
+ description=_DESCRIPTION,
126
+ citation=_CITATION,
127
+ inputs_description=_KWARGS_DESCRIPTION,
128
+ features=datasets.Features(
129
+ {
130
+ "data": datasets.Value("string", id="sequence"),
131
+ "references": datasets.Value("string", id="sequence"),
132
+ }
133
+ if self.config_name == "compare"
134
+ else {
135
+ "data": datasets.Value("string", id="sequence"),
136
+ }
137
+ ),
138
+ codebase_urls=[],
139
+ reference_urls=[],
140
+ )
141
+
142
+ def _download_and_prepare(self, dl_manager):
143
+ regard_tokenizer = AutoTokenizer.from_pretrained("sasha/regardv3")
144
+ regard_model = AutoModelForSequenceClassification.from_pretrained("sasha/regardv3")
145
+ self.regard_classifier = pipeline(
146
+ "text-classification", model=regard_model, top_k=4, tokenizer=regard_tokenizer, truncation=True
147
+ )
148
+
149
+ def _compute(
150
+ self,
151
+ data,
152
+ references=None,
153
+ aggregation=None,
154
+ ):
155
+ if self.config_name == "compare":
156
+ pred_scores, pred_regard = regard(data, self.regard_classifier)
157
+ ref_scores, ref_regard = regard(references, self.regard_classifier)
158
+ pred_mean = {k: mean(v) for k, v in pred_regard.items()}
159
+ pred_max = {k: max(v) for k, v in pred_regard.items()}
160
+ ref_mean = {k: mean(v) for k, v in ref_regard.items()}
161
+ ref_max = {k: max(v) for k, v in ref_regard.items()}
162
+ if aggregation == "maximum":
163
+ return {
164
+ "max_data_regard": pred_max,
165
+ "max_references_regard": ref_max,
166
+ }
167
+ elif aggregation == "average":
168
+ return {"average_data_regard": pred_mean, "average_references_regard": ref_mean}
169
+ else:
170
+ return {"regard_difference": {key: pred_mean[key] - ref_mean.get(key, 0) for key in pred_mean}}
171
+ else:
172
+ pred_scores, pred_regard = regard(data, self.regard_classifier)
173
+ pred_mean = {k: mean(v) for k, v in pred_regard.items()}
174
+ pred_max = {k: max(v) for k, v in pred_regard.items()}
175
+ if aggregation == "maximum":
176
+ return {"max_regard": pred_max}
177
+ elif aggregation == "average":
178
+ return {"average_regard": pred_mean}
179
+ else:
180
+ return {"regard": pred_scores}
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
1
+ git+https://github.com/huggingface/evaluate.git@1bb5f431d16a789950784660b26c650e1ab0e3cc
2
+ transformers