Hate Speech Severity Predictor β€” BERT

Model Description

This is a fine-tuned BERT model (bert-base-uncased) for hate speech severity prediction, developed as part of an MSc research project at the University of Moratuwa, Sri Lanka.

The model predicts hate speech severity across three levels:

  • Level 0 β€” Non-hate Speech
  • Level 1 β€” Mild / Offensive
  • Level 2 β€” Severe Hate Speech

It also produces a continuous severity score S in [0,1]: S = 0.0 x P(Level 0) + 0.5 x P(Level 1) + 1.0 x P(Level 2)

Model Details

  • Developed by: J.A.U.S. Jayakody (239817M), University of Moratuwa
  • Supervised by: Dr. Supunmali Ahangama
  • Base model: bert-base-uncased
  • Language: English
  • License: MIT

Dataset

Fine-tuned on HateXplain (Mathew et al., 2021):

  • 20,148 posts from Twitter and Gab
  • Stratified 70-15-15 train-validation-test split

Training Details

  • Epochs: 3 (best checkpoint: Epoch 2)
  • Batch size: 16
  • Learning rate: 2e-5
  • Max sequence length: 128 tokens
  • Class weighting: Balanced
  • Hardware: Tesla T4 GPU

Evaluation Results

Metric SVM BERT
Accuracy 0.629 0.684
Macro F1 0.615 0.679

Severity Prediction Metrics:

  • Spearman Correlation: 0.714
  • Pearson Correlation: 0.720
  • Mean Absolute Error: 0.212
  • RMSE: 0.292

How to Use

from transformers import BertForSequenceClassification, BertTokenizer
import torch
import torch.nn.functional as F
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('UdaniSJ/hate-speech-severity-bert')
model.eval()

def predict_severity(text):
    inputs = tokenizer(text, return_tensors='pt',
                       truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=1).numpy()[0]
    score = 0.0*probs[0] + 0.5*probs[1] + 1.0*probs[2]
    level = int(np.argmax(probs))
    names = {0:'Non-hate', 1:'Mild', 2:'Severe'}
    return {'level': names[level], 'score': round(float(score),3)}

print(predict_severity("I love all people regardless of background"))

Live Demo

https://huggingface.co/spaces/UdaniSJ/hate-speech-severity-predictor

Limitations

  • Trained on English social media content only
  • May exhibit lexical over-reliance on identity terms
  • Context-aware adjustment partially mitigates reclaimed language misclassification

References

  • Mathew et al. (2021). HateXplain. AAAI 2021.
  • Devlin et al. (2019). BERT. NAACL 2019.
  • Ribeiro et al. (2016). LIME. KDD 2016.
  • Lundberg and Lee (2017). SHAP. NeurIPS 2017.
Downloads last month
43
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train UdaniSJ/hate-speech-severity-bert

Space using UdaniSJ/hate-speech-severity-bert 1