Model Card for ToxiGen-ConPrompt
ToxiGen-ConPrompt is a pre-trained language model for implicit hate speech detection. The model is pre-trained on a machine-generated dataset for implicit hate speech detection (i.e., ToxiGen) using our proposing pre-training approach (i.e., ConPrompt).
Model Details
- Base Model: BERT-base-uncased
- Pre-training Source: ToxiGen (https://aclanthology.org/2022.acl-long.234/)
- Pre-training Approach: ConPrompt
Ethical Considerations
Privacy Issue
Before pre-training, we found out that some private information such as URLs exists in the machine-generated statements in ToxiGen. We anonymize such private information before pre-training to prevent any harm to our society. You can refer to the anonymization code we used in preprocess_toxigen.ipynb and we strongly emphasize to anonymize private information before using machine-generated data for pre-training.
Potential Misuse
The pre-training source of ToxiGen-ConPrompt includes toxic statements. While we use such toxic statements on purpose to pre-train a better model for implicit hate speech detection, the pre-trained model needs careful handling. Here, we states some behavior that can lead to potential misuse so that our model is used for the social good rather than misued unintentionally or maliciously.
- As our model was trained with the MLM objective, our model might generate toxic statements with its MLM head
- As our model learned representations regarding implicit hate speeches, our model might retrieve some similar toxic statements given a toxic statement.
While these behavior can lead to social good e.g., constructing training data for hate speech classifiers, one can potentially misuse the behaviors.
We strongly emphasize the need for careful handling to prevent unintentional misuse and warn against malicious exploitation of such behaviors.
Acknowledgements
- We use the ToxiGen dataset as a pre-training source to pre-train our model. You can refer to the paper here.
- We anonymize private information in the pre-training source following the code from https://github.com/dhfbk/hate-speech-artifacts.
- Our pre-training code is based on the code from https://github.com/princeton-nlp/SimCSE with some modifications.
- We use the code from https://github.com/youngwook06/ImpCon to fine-tune and evaluate our model.
- Downloads last month
- 9