cooperleong00/deberta-v3-large_toxicity-scorer

This model is the toxicity classifier used in the paper Self-Detoxifying Language Models via Toxification Reversal.

We did not use the Perspective API to assess the toxicity of newly generated text due to its limitations on request throughput. Instead, we trained an offline toxicity scorer on 90k RTP samples not used for evaluation to improve efficiency. Specifically, we fine-tuned a DeBERTa-v3-large (He et al., 2023) model to fit the original API’s toxicity probabilities by minimizing the KL divergence. This fine-tuned model achieved 94.87% accuracy and a 98.54% AUROC score on the hold-out 10k subset, which indicates that it can effectively estimate text toxicity as a substitute for the API. With this accurate estimation performance guarantee, the model has a much higher throughput than the API, i.e., 27,000 samples per second versus typically 25 queries per second using the API.

Original model: https://huggingface.co/microsoft/deberta-v3-large
Fine-tuning dataset: https://github.com/cooperleong00/ToxificationReversal/blob/master/data/rtp-train-90k.jsonl