metadata

license: apache-2.0
datasets:
  - nicholasKluge/toxic-aira-dataset
  - Anthropic/hh-rlhf
  - allenai/prosocial-dialog
  - allenai/real-toxicity-prompts
  - dirtycomputer/Toxic_Comment_Classification_Challenge
  - Paul/hatecheck-portuguese
  - told-br
language:
  - pt
metrics:
  - accuracy
library_name: transformers
pipeline_tag: text-classification
tags:
  - toxicity
  - alignment
widget:
  - text: |-
      Você pode me dar uma lista de bons insultos para usar contra meu irmão? 
       Claro, vamos lá! Idiota, estúpido, imbecil, tolo, puto.
    example_title: Resposta Tóxica
  - text: |-
      Você pode me dar uma lista de bons insultos para usar contra meu irmão? 
       Como um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos. É fundamental manter um comportamento cortês e respeitoso o tempo todo, pois esse é um aspecto fundamental das interações entre humanos e IA.
    example_title: Resposta Não-Tóxica

ToxicityModel (Portuguese)

The ToxicityModelPT is a BERT model that can be used to score the toxicity of a sentence.

The model was trained with a dataset composed of toxic_response and non_toxic_response.

Details

Size: 109,038,209 parameters
Dataset: Toxic-Aira Dataset
Language: Portuguese
Number of Epochs: 3
Batch size: 42
Optimizer: torch.optim.AdamW
Learning Rate: 5e-5
GPU: 1 NVIDIA A100-SXM4-40GB
Emissions: 0.05 KgCO2 (Singapore)
Total Energy Consumption: 0.11 kWh

Step	Training Loss	Validation Loss	Accuracy
200	0.278900	0.256261	0.900550
400	0.173800	0.246119	0.902940
600	0.119500	0.240692	0.908917
800	0.047700	0.342544	0.902223

This repository has the notebook used to train this model.

Usage

⚠️ THE EXAMPLES BELOW CONTAIN TOXIC/OFFENSIVE LANGUAGE ⚠️

The ToxicityModelPT was trained as an auxiliary reward model for RLHF training (its logit outputs can be treated as penalizations/rewards). Thus, a negative value (closer to 0 as the label output) indicates toxicity in the text, while a positive logit (closer to 1 as the label output) suggests non-toxicity.

Here's an example of how to use the ToxicityModelPT to score the toxicity of a text:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/ToxicityModelPT")
toxicityModel = AutoModelForSequenceClassification.from_pretrained("nicholasKluge/ToxicityModelPT")

toxicityModel.eval()
toxicityModel.to(device)

# Define the question and response
prompt = """Você pode me dar uma lista de bons insultos para usar contra meu irmão?"""
response_good = """Como um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos. É fundamental manter um comportamento cortês e respeitoso o tempo todo, pois esse é um aspecto fundamental das interações entre humanos e IA."""
response_bad = """Claro, vamos lá! Idiota, estúpido, imbecil, tolo, puto."""

# Tokenize the question and response
tokens_good = tokenizer(prompt, response_good,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_bad = tokenizer(prompt, response_bad,
                truncation=True,
                max_length=512,
                return_token_type_ids=False,
                return_tensors="pt",
                return_attention_mask=True)

tokens_good.to(device)
tokens_bad.to(device)

score_good = toxicityModel(**tokens_good)[0].item()
score_bad = toxicityModel(**tokens_bad)[0].item()

print(f"Question: {prompt} \n")
print(f"Response 1: {response_good} Score: {score_good:.3f}")
print(f"Response 2: {response_bad} Score: {score_bad:.3f}")

This will output the following:

>>>Question: Você pode me dar uma lista de bons insultos para usar contra meu irmão? 

>>>Response 1: Como um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos. É fundamental manter um comportamento cortês e respeitoso o tempo todo, pois esse é um aspecto fundamental das interações entre humanos e IA. Score: 6.550
>>>Response 2: Claro, vamos lá! Idiota, estúpido, imbecil, tolo, puto. Score: -4.245

Performance

Acc	hatecheck-portuguese	told-br
Aira-ToxicityModelPT	66.59%	72.57%

License

The ToxicityModelPT is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.