File size: 4,715 Bytes
630dedc
 
cf33e8d
 
 
630dedc
cf33e8d
630dedc
 
cf33e8d
 
 
 
 
630dedc
 
cf33e8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d8e25f
cf33e8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d8e25f
cf33e8d
 
 
 
 
 
 
 
 
 
 
 
 
 
7d8e25f
cf33e8d
 
 
 
 
 
 
7d8e25f
cf33e8d
 
 
 
 
 
 
7d8e25f
cf33e8d
 
 
 
 
7d8e25f
cf33e8d
7d8e25f
cf33e8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: Toxicity
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- measurement
description: >-
  The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model.
---

# Measurement Card for Toxicity

## Measurement description
The toxicity measurement aims to quantify the toxicity of the input texts using a pretrained hate speech classification model.

## How to use

The default model used is [roberta-hate-speech-dynabench-r4](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target). In this model, ‘hate’ is defined as “abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender, or sexual orientation.” Definitions used by other classifiers may vary.
When loading the measurement, you can also specify another model:
```
toxicity = evaluate.load("toxicity", 'DaNLP/da-electra-hatespeech-detection', module_type="measurement",)
```
The model should be compatible with the AutoModelForSequenceClassification class.
For more information, see [the AutoModelForSequenceClassification documentation]( https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForSequenceClassification).

Args:
    `predictions` (list of str): prediction/candidate sentences
    `toxic_label` (str) (optional): the toxic label that you want to detect, depending on the labels that the model has been trained on.
        This can be found using the `id2label` function, e.g.:
        ```python
        >>> model = AutoModelForSequenceClassification.from_pretrained("DaNLP/da-electra-hatespeech-detection")
        >>> model.config.id2label
        {0: 'not offensive', 1: 'offensive'}
        ```
        In this case, the `toxic_label` would be `offensive`.
    `aggregation` (optional): determines the type of aggregation performed on the data. If set to `None`, the scores for each prediction are returned.
    Otherwise:
      - 'maximum': returns the maximum toxicity over all predictions
      - 'ratio': the percentage of predictions with toxicity above a certain threshold.
    `threshold`: (int) (optional): the toxicity detection to be used for calculating the 'ratio' aggregation, described above. The default threshold is 0.5, based on the one established by [RealToxicityPrompts](https://arxiv.org/abs/2009.11462).

## Output values

    `toxicity`: a list of toxicity scores, one for each sentence in `predictions` (default behavior)

    `max_toxicity`: the maximum toxicity over all scores (if `aggregation` = `maximum`)

    `toxicity_ratio` : the percentage of predictions with toxicity >= 0.5 (if `aggregation` = `ratio`)


### Values from popular papers


## Examples
    Example 1 (default behavior):
```python
>>> toxicity = evaluate.load("toxicity", module_type="measurement")
>>> input_texts = ["she went to the library", "he is a douchebag"]
>>> results = toxicity.compute(predictions=input_texts)
>>> print([round(s, 4) for s in results["toxicity"]])
[0.0002, 0.8564]
```
    Example 2 (returns ratio of toxic sentences):
```python
>>> toxicity = evaluate.load("toxicity", module_type="measurement")
>>> input_texts = ["she went to the library", "he is a douchebag"]
>>> results = toxicity.compute(predictions=input_texts, aggregation="ratio")
>>> print(results['toxicity_ratio'])
0.5
```
    Example 3 (returns the maximum toxicity score):
```python
>>> toxicity = evaluate.load("toxicity", module_type="measurement")
>>> input_texts = ["she went to the library", "he is a douchebag"]
>>> results = toxicity.compute(predictions=input_texts, aggregation="maximum")
>>> print(round(results['max_toxicity'], 4))
0.8564
```
    Example 4 (uses a custom model):
```python
>>> toxicity = evaluate.load("toxicity", 'DaNLP/da-electra-hatespeech-detection')
>>> input_texts = ["she went to the library", "he is a douchebag"]
>>> results = toxicity.compute(predictions=input_texts, toxic_label='offensive')
>>> print([round(s, 4) for s in results["toxicity"]])
[0.0176, 0.0203]
```



## Citation

```bibtex
@inproceedings{vidgen2021lftw,
  title={Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection},
  author={Bertie Vidgen and Tristan Thrush and Zeerak Waseem and Douwe Kiela},
  booktitle={ACL},
  year={2021}
}
```

```bibtex
@article{gehman2020realtoxicityprompts,
  title={Realtoxicityprompts: Evaluating neural toxic degeneration in language models},
  author={Gehman, Samuel and Gururangan, Suchin and Sap, Maarten and Choi, Yejin and Smith, Noah A},
  journal={arXiv preprint arXiv:2009.11462},
  year={2020}
}

```

## Further References