This is the model card for HateBERTimbau-Twitter. You may be interested in some of the other models from the kNOwHATE project.

HateBERTimbau-Twitter

HateBERTimbau-Twitter is a transformer-based encoder model for identifying Hate Speech in Portuguese social media text. It is a fine-tuned version of HateBERTimbau model, retrained on a dataset of 21,546 tweets specifically focused on Hate Speech.

Model Description

Developed by: kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate
Funded by: European Union
Model type: Transformer-based text classification model fine-tuned for Hate Speech detection in Portuguese social media text
Language: Portuguese
Fine-tuned from model: knowhate/HateBERTimbau

Uses

You can use this model directly with a pipeline for text classification:

from transformers import pipeline
classifier = pipeline('text-classification', model='knowhate/HateBERTimbau-twitter')

classifier("Isso pulhiticos merdosos, continuem a importar lixo, até Portugal deixar de ser Portugal.")

[{'label': 'Hate Speech', 'score': 0.86262047290802}]

Or this model can be used by fine-tuning it for a specific task/dataset:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau-twitter")
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau-twitter")
dataset = load_dataset("knowhate/youtube-train")

def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

Training

Data

21,546 tweets associated with offensive content were used to fine-tune the base model.

Training Hyperparameters

Batch Size: 32
Epochs: 3
Learning Rate: 2e-5 with Adam optimizer
Maximum Sequence Length: 350 tokens

Testing

Data

The dataset used to test this model was: knowhate/twitter-test

Results

Dataset	Precision	Recall	F1-score
knowhate/twitter-test	0.443	0.470	0.456

BibTeX Citation

Currently in Peer Review

@article{

}

Acknowledgements

This work was funded in part by the European Union under Grant CERV-2021-EQUAL (101049306). However the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project. Neither the European Union nor the Knowhate Project can be held responsible.

knowhate
/

HateBERTimbau-twitter

HateBERTimbau-Twitter

Model Description

Uses

Training

Data

Training Hyperparameters

Testing

Data

Results

BibTeX Citation

Acknowledgements

Space using knowhate/HateBERTimbau-twitter 1