ToxiGen-ConPrompt / README.md
youngggggg's picture
Update README.md
888a879
|
raw
history blame
4.49 kB
---
license: mit
datasets:
- skg/toxigen-data
language:
- en
---
# Model Card for ToxiGen-ConPrompt
<!-- Provide a quick summary of what the model is/does. -->
<!-- {{ model_summary | default("", true) }} -->
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
{{ model_description | default("", true) }}
<!--- **Developed by:** {{ developers | default("[More Information Needed]", true)}} -->
<!--- **Funded by [optional]:** {{ funded_by | default("[More Information Needed]", true)}} -->
<!--- **Shared by [optional]:** {{ shared_by | default("[More Information Needed]", true)}} -->
- **Model type:** Feature Extraction
- **Base Model:** BERT-base-uncased
- **Pre-training Source:** ToxiGen
- **Pre-training Approach:** ConPrompt
<!-- Provide the basic links for the model. -->
- **ConPrompt Repository:** https://github.com/youngwook06/ConPrompt
- **ConPrompt Paper:** https://aclanthology.org/2023.findings-emnlp.731/
## Ethical Considerations
### Privacy Issue
Before pre-training, we found out that some private information such as URLs exists in the machine-generated statements in ToxiGen.
We anonymize such private information before pre-training to prevent any harm to our society.
You can refer to the anonymization code we used in preprocess_toxigen.ipynb and we strongly emphasize to anonymize private information before using machine-generated data for pre-training.
### Potential Misuse
The pre-training source of ToxiGen-ConPrompt includes toxic statements.
While we use such toxic statements on purpose to pre-train a better model for implicit hate speech detection, the pre-trained model needs careful handling.
Here, we states some behavior that can lead to potential misuse so that our model is used for the social good rather than misued unintentionally or maliciously.
- As our model was trained with the MLM objective, our model might generate toxic statements with its MLM head
- As our model learned representations regarding implicit hate speeches, our model might retrieve some similar toxic statements given a toxic statement.
While these behavior can lead to social good e.g., constructing training data for hate speech classifiers, one can potentially misuse the behaviors.
**We strongly emphasize the need for careful handling to prevent unintentional misuse and warn against malicious exploitation of such behaviors.**
## Citation
**BibTeX:**
@inproceedings{kim-etal-2023-conprompt,
title = "{C}on{P}rompt: Pre-training a Language Model with Machine-Generated Data for Implicit Hate Speech Detection",
author = "Kim, Youngwook and
Park, Shinwoo and
Namgoong, Youngsoo and
Han, Yo-Sub",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.731",
doi = "10.18653/v1/2023.findings-emnlp.731",
pages = "10964--10980",
abstract = "Implicit hate speech detection is a challenging task in text classification since no explicit cues (e.g., swear words) exist in the text. While some pre-trained language models have been developed for hate speech detection, they are not specialized in implicit hate speech. Recently, an implicit hate speech dataset with a massive number of samples has been proposed by controlling machine generation. We propose a pre-training approach, ConPrompt, to fully leverage such machine-generated data. Specifically, given a machine-generated statement, we use example statements of its origin prompt as positive samples for contrastive learning. Through pre-training with ConPrompt, we present ToxiGen-ConPrompt, a pre-trained language model for implicit hate speech detection. We conduct extensive experiments on several implicit hate speech datasets and show the superior generalization ability of ToxiGen-ConPrompt compared to other pre-trained models. Additionally, we empirically show that ConPrompt is effective in mitigating identity term bias, demonstrating that it not only makes a model more generalizable but also reduces unintended bias. We analyze the representation quality of ToxiGen-ConPrompt and show its ability to consider target group and toxicity, which are desirable features in terms of implicit hate speeches.",
}