metadata

license: mit
datasets:
  - skg/toxigen-data
language:
  - en

Model Card for ToxiGen-ConPrompt

Model Details

Model Description

Model type: Feature Extraction
Base Model: BERT-base-uncased
Pre-training Source: ToxiGen
Pre-training Approach: ConPrompt

ConPrompt Repository: https://github.com/youngwook06/ConPrompt
ConPrompt Paper: https://aclanthology.org/2023.findings-emnlp.731/

Ethical Considerations

Privacy Issue

Before pre-training, we found out that some private information such as URLs exists in the machine-generated statements in ToxiGen. We anonymize such private information before pre-training to prevent any harm to our society. You can refer to the anonymization code we used in preprocess_toxigen.ipynb and we strongly emphasize to anonymize private information before using machine-generated data for pre-training.

Potential Misuse

The pre-training source of ToxiGen-ConPrompt includes toxic statements. While we use such toxic statements on purpose to pre-train a better model for implicit hate speech detection, the pre-trained model needs careful handling. Here, we states some behavior that can lead to potential misuse so that our model is used for the social good rather than misued unintentionally or maliciously.

As our model was trained with the MLM objective, our model might generate toxic statements with its MLM head
As our model learned representations regarding implicit hate speeches, our model might retrieve some similar toxic statements given a toxic statement.

While these behavior can lead to social good e.g., constructing training data for hate speech classifiers, one can potentially misuse the behaviors.

We strongly emphasize the need for careful handling to prevent unintentional misuse and warn against malicious exploitation of such behaviors.

Citation

BibTeX:

@inproceedings{kim-etal-2023-conprompt, title = "{C}on{P}rompt: Pre-training a Language Model with Machine-Generated Data for Implicit Hate Speech Detection", author = "Kim, Youngwook and Park, Shinwoo and Namgoong, Youngsoo and Han, Yo-Sub", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.731", doi = "10.18653/v1/2023.findings-emnlp.731", pages = "10964--10980", abstract = "Implicit hate speech detection is a challenging task in text classification since no explicit cues (e.g., swear words) exist in the text. While some pre-trained language models have been developed for hate speech detection, they are not specialized in implicit hate speech. Recently, an implicit hate speech dataset with a massive number of samples has been proposed by controlling machine generation. We propose a pre-training approach, ConPrompt, to fully leverage such machine-generated data. Specifically, given a machine-generated statement, we use example statements of its origin prompt as positive samples for contrastive learning. Through pre-training with ConPrompt, we present ToxiGen-ConPrompt, a pre-trained language model for implicit hate speech detection. We conduct extensive experiments on several implicit hate speech datasets and show the superior generalization ability of ToxiGen-ConPrompt compared to other pre-trained models. Additionally, we empirically show that ConPrompt is effective in mitigating identity term bias, demonstrating that it not only makes a model more generalizable but also reduces unintended bias. We analyze the representation quality of ToxiGen-ConPrompt and show its ability to consider target group and toxicity, which are desirable features in terms of implicit hate speeches.", }