finiteautomata's picture
Update README.md
e9d7c2e
---
language:
- es
metrics:
- f1
pipeline_tag: text-classification
---
## Contextualized, fine-grained hate speech detection
Try our [demo]((https://huggingface.co/spaces/piubamas/discurso-de-odio).
Model trained to detect hate speech comments in news articles. Base model is BETO, a Spanish BERT pre-trained model. The task the model was trained on is a multilabel classification problem, where each input have a label for each of the considered groups:
| Label | Description |
| :--------- | :-------------------------------------- |
| WOMEN | Against women |
| LGBTI | Against LGBTI |
| RACISM | Racist |
| CLASS | Classist |
| POLITICS | Because of politics |
| DISABLED | Against disabled |
| APPEARANCE | Against people because their appearance |
| CRIMINAL | Against criminals |
There is an extra label `CALLS`, which represents whether a comment is a call to violent action or not.
## Input
The model was trained taking into account both the comment and the context. To feed this model, use the template
```python
TEXT [SEP] CONTEXT
```
where `[SEP]` is the special token used to separate the comment from the context.
### Example
If we want to analyze
```
Comment: Hay que matarlos a todos!!! Nos infectaron con su virus!
Context: China prohibi贸 la venta de perros y gatos para consumo humano
```
The input should be
```python
Hay que matarlos a todos!!! Nos infectaron con su virus! [SEP] China prohibi贸 la venta de perros y gatos para consumo humano
```
## Usage:
Sadly, the `huggingface` pipeline does not support multi-label classification, so this model cannot be tested directly in the side widget.
To use it, you can try our [demo](https://huggingface.co/spaces/piubamas/discurso-de-odio). If you want to use it with your own code, use the following snippet:
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "piubamas/beto-contextualized-hate-speech"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
id2label = [model.config.id2label[k] for k in range(len(model.config.id2label))]
def predict(*args):
encoding = tokenizer.encode_plus(*args)
inputs = {
k:torch.LongTensor(encoding[k]).reshape(1, -1) for k in {"input_ids", "attention_mask", "token_type_ids"}
}
output = model.forward(
**inputs
)
chars = list(zip(id2label, list(output.logits[0].detach().cpu().numpy() > 0)))
return [char for char, pred in chars if pred]
context = "China proh铆be la cr铆a de perros para consumo humano")
text = "Chinos hdrmp hay que matarlos a todos"
prediction = predict(text, context)
```
## Citation
```bibtex
@article{perez2023assessing,
title={Assessing the impact of contextual information in hate speech detection},
author={P{\'e}rez, Juan Manuel and Luque, Franco M and Zayat, Demian and Kondratzky, Mart{\'\i}n and Moro, Agust{\'\i}n and Serrati, Pablo Santiago and Zajac, Joaqu{\'\i}n and Miguel, Paula and Debandi, Natalia and Gravano, Agust{\'\i}n and others},
journal={IEEE Access},
volume={11},
pages={30575--30590},
year={2023},
publisher={IEEE}
}
```