Hate Speech Severity Predictor — BERT

Model Description

This is a fine-tuned BERT model (bert-base-uncased) for hate speech severity prediction, developed as part of an MSc research project at the University of Moratuwa, Sri Lanka.

The model predicts hate speech severity across three levels:

Level 0 — Non-hate Speech
Level 1 — Mild / Offensive
Level 2 — Severe Hate Speech

It also produces a continuous severity score S in [0,1]: S = 0.0 x P(Level 0) + 0.5 x P(Level 1) + 1.0 x P(Level 2)

Model Details

Developed by: J.A.U.S. Jayakody (239817M), University of Moratuwa
Supervised by: Dr. Supunmali Ahangama
Base model: bert-base-uncased
Language: English
License: MIT

Dataset

Fine-tuned on HateXplain (Mathew et al., 2021):

20,148 posts from Twitter and Gab
Stratified 70-15-15 train-validation-test split

Training Details

Epochs: 3 (best checkpoint: Epoch 2)
Batch size: 16
Learning rate: 2e-5
Max sequence length: 128 tokens
Class weighting: Balanced
Hardware: Tesla T4 GPU

Evaluation Results

Metric	SVM	BERT
Accuracy	0.629	0.684
Macro F1	0.615	0.679

Severity Prediction Metrics:

Spearman Correlation: 0.714
Pearson Correlation: 0.720
Mean Absolute Error: 0.212
RMSE: 0.292

How to Use

from transformers import BertForSequenceClassification, BertTokenizer
import torch
import torch.nn.functional as F
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('UdaniSJ/hate-speech-severity-bert')
model.eval()

def predict_severity(text):
    inputs = tokenizer(text, return_tensors='pt',
                       truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=1).numpy()[0]
    score = 0.0*probs[0] + 0.5*probs[1] + 1.0*probs[2]
    level = int(np.argmax(probs))
    names = {0:'Non-hate', 1:'Mild', 2:'Severe'}
    return {'level': names[level], 'score': round(float(score),3)}

print(predict_severity("I love all people regardless of background"))

Live Demo

https://huggingface.co/spaces/UdaniSJ/hate-speech-severity-predictor

Limitations

Trained on English social media content only
May exhibit lexical over-reliance on identity terms
Context-aware adjustment partially mitigates reclaimed language misclassification

References

Mathew et al. (2021). HateXplain. AAAI 2021.
Devlin et al. (2019). BERT. NAACL 2019.
Ribeiro et al. (2016). LIME. KDD 2016.
Lundberg and Lee (2017). SHAP. NeurIPS 2017.

Downloads last month: 43

Safetensors

Model size

0.1B params

Tensor type

F32

UdaniSJ
/

hate-speech-severity-bert