File size: 6,385 Bytes
15d76c4 e674a33 15d76c4 e674a33 15d76c4 890f163 a31e1fb aa474f6 cc1021a e4b72b3 890f163 15d76c4 cc1021a 15d76c4 1506cf0 15d76c4 89a89a4 15d76c4 c2c47a6 1506cf0 c2c47a6 1506cf0 15d76c4 71a2f09 15d76c4 4c6ed70 de735ca 4c6ed70 de735ca d390446 de735ca d390446 de735ca 15d76c4 71a2f09 fa1a732 71a2f09 15d76c4 aa35a5a 15d76c4 71a2f09 15d76c4 647cdcf 15d76c4 71a2f09 fa1a732 71a2f09 15d76c4 8761ba5 4c6ed70 15d76c4 5204f8d fa1a732 5204f8d 8761ba5 71a2f09 15d76c4 b19feb5 d0ae13f aa35a5a b19feb5 71a2f09 a7a5b51 89a89a4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
license: cc
language:
- pt
tags:
- Hate Speech
- kNOwHATE
- not-for-all-audiences
widget:
- text: Os [MASK] são todos uns animais, deviam voltar para a sua terra.
---
---
<img align="left" width="140" height="140" src="https://ilga-portugal.pt/files/uploads/2023/06/logo_HATE_cores_page-0001-1024x539.jpg">
<p style="text-align: center;"> This is the model card for HateBERTimbau.
You may be interested in some of the other models from the <a href="https://huggingface.co/knowhate">kNOwHATE project</a>.
</p>
---
# HateBERTimbau
**HateBERTimbau** is a foundation, large language model for European **Portuguese** from **Portugal** for Hate Speech content.
It is an **encoder** of the BERT family, based on the neural architecture Transformer and
developed over the [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) model, retrained on a dataset of 229,103 tweets specifically focused on potential hate speech.
## Model Description
- **Developed by:** [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu)
- **Funded by:** [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal)
- **Model type:** Transformer-based model retrained for Hate Speech in Portuguese social media text
- **Language:** Portuguese
- **Retrained from model:** [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased)
Several models were developed by fine-tuning Base HateBERTimbau for Hate Speech detection present in the table bellow:
| HateBERTimbau's Family of Models |
|---------------------------------------------------------------------------------------------------------|
| [**HateBERTimbau YouTube**](https://huggingface.co/knowhate/HateBERTimbau-youtube) |
| [**HateBERTimbau Twitter**](https://huggingface.co/knowhate/HateBERTimbau-twitter) |
| [**HateBERTimbau YouTube+Twitter**](https://huggingface.co/knowhate/HateBERTimbau-yt-tt)|
# Uses
You can use this model directly with a pipeline for masked language modeling:
```python
from transformers import pipeline
unmasker = pipeline('fill-mask', model='knowhate/HateBERTimbau')
unmasker("Os [MASK] são todos uns animais, deviam voltar para a sua terra.")
[{'score': 0.6771652698516846,
'token': 12714,
'token_str': 'africanos',
'sequence': 'Os africanos são todos uns animais, deviam voltar para a sua terra.'},
{'score': 0.08679857850074768,
'token': 15389,
'token_str': 'homossexuais',
'sequence': 'Os homossexuais são todos uns animais, deviam voltar para a sua terra.'},
{'score': 0.03806231543421745,
'token': 4966,
'token_str': 'portugueses',
'sequence': 'Os portugueses são todos uns animais, deviam voltar para a sua terra.'},
{'score': 0.035253893584012985,
'token': 16773,
'token_str': 'Portugueses',
'sequence': 'Os Portugueses são todos uns animais, deviam voltar para a sua terra.'},
{'score': 0.023521048948168755,
'token': 8618,
'token_str': 'brancos',
'sequence': 'Os brancos são todos uns animais, deviam voltar para a sua terra.'}]
```
Or this model can be used by fine-tuning it for a specific task/dataset:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau")
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau")
dataset = load_dataset("knowhate/youtube-train")
def tokenize_function(examples):
return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
)
trainer.train()
```
# Training
## Data
229,103 tweets associated with offensive content were used to retrain the base model.
## Training Hyperparameters
- Batch Size: 4 samples
- Epochs: 100
- Learning Rate: 5e-5 with Adam optimizer
- Maximum Sequence Length: 512 sentence pieces
# Testing
## Data
We used two different datasets for testing, one for YouTube comments [here](https://huggingface.co/datasets/knowhate/youtube-test) and another for Tweets [here](https://huggingface.co/datasets/knowhate/twitter-test).
## Hate Speech Classification Results (with no fine-tuning)
| Dataset | Precision | Recall | F1-score |
|:----------------|:-----------|:----------|:-------------|
| **YouTube** | 0.928 | 0.108 | **0.193** |
| **Twitter** | 0.686 | 0.211 | **0.323** |
# BibTeX Citation
``` latex
@inproceedings{DBLP:conf/slate/MatosS00B22,
author = {Bernardo Cunha Matos and
Raquel Bento Santos and
Paula Carvalho and
Ricardo Ribeiro and
Fernando Batista},
editor = {Jo{\~{a}}o Cordeiro and
Maria Jo{\~{a}}o Pereira and
Nuno F. Rodrigues and
Sebasti{\~{a}}o Pais},
title = {Comparing Different Approaches for Detecting Hate Speech in Online
Portuguese Comments},
booktitle = {11th Symposium on Languages, Applications and Technologies, {SLATE}
2022, July 14-15, 2022, Universidade da Beira Interior, Covilh{\~{a}},
Portugal},
series = {OASIcs},
volume = {104},
pages = {10:1--10:12},
publisher = {Schloss Dagstuhl - Leibniz-Zentrum f{\"{u}}r Informatik},
year = {2022},
url = {https://doi.org/10.4230/OASIcs.SLATE.2022.10},
doi = {10.4230/OASICS.SLATE.2022.10},
}
```
# Acknowledgements
This work was funded in part by the European Union under Grant CERV-2021-EQUAL (101049306).
However the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project.
Neither the European Union nor the Knowhate Project can be held responsible. |