ModerationBERT-ML-En

ModerationBERT-ML-En is a moderation model based on bert-base-multilingual-cased. This model is designed to perform text moderation tasks, specifically categorizing text into 18 different categories. It currently works only with English text.

Check out the new version of the model! Even more accurate and better!

Dataset

The model was trained and fine-tuned using the text-moderation-410K dataset. This dataset contains a wide variety of text samples labeled with different moderation categories.

Model Description

ModerationBERT-ML-En uses the BERT architecture to classify text into the following categories:

harassment
harassment_threatening
hate
hate_threatening
self_harm
self_harm_instructions
self_harm_intent
sexual
sexual_minors
violence
violence_graphic
self-harm
sexual/minors
hate/threatening
violence/graphic
self-harm/intent
self-harm/instructions
harassment/threatening

Training and Fine-Tuning

The model was trained using a 95% subset of the dataset for training and a 5% subset for evaluation. The training was performed in two stages:

Initial Training: The classifier layer was trained with frozen BERT layers.
Fine-Tuning: The top two layers of the BERT model were unfrozen and the entire model was fine-tuned.

Installation

To use ModerationBERT-ML-En, you will need to install the transformers library from Hugging Face and torch.

pip install transformers torch

Usage

Here is an example of how to use ModerationBERT-ML-En to predict the moderation categories for a given text:

import json
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load the tokenizer and model
model_name = "ifmain/ModerationBERT-ML-En"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=18)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

def predict(text, model, tokenizer):
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=128,
        return_token_type_ids=False,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    model.eval()
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    predictions = torch.sigmoid(outputs.logits)  # Convert logits to probabilities
    return predictions

# Example usage
new_text = "Fuck off stuped trash"
predictions = predict(new_text, model, tokenizer)

# Define the categories
categories = ['harassment', 'harassment_threatening', 'hate', 'hate_threatening', 
              'self_harm', 'self_harm_instructions', 'self_harm_intent', 'sexual', 
              'sexual_minors', 'violence', 'violence_graphic', 'self-harm', 
              'sexual/minors', 'hate/threatening', 'violence/graphic', 
              'self-harm/intent', 'self-harm/instructions', 'harassment/threatening']

# Convert predictions to a dictionary
category_scores = {categories[i]: predictions[0][i].item() for i in range(len(categories))}

output = {
    "text": new_text,
    "category_scores": category_scores
}

# Print the result as a JSON with indentation
print(json.dumps(output, indent=4, ensure_ascii=False))

Output:

{
    "text": "Fuck off stuped trash",
    "category_scores": {
        "harassment": 0.9272650480270386,
        "harassment_threatening": 0.0013139015063643456,
        "hate": 0.011709265410900116,
        "hate_threatening": 1.1083522622357123e-05,
        "self_harm": 0.00039102151640690863,
        "self_harm_instructions": 0.0002464024000801146,
        "self_harm_intent": 0.00031603744719177485,
        "sexual": 0.020730027928948402,
        "sexual_minors": 0.00018848323088604957,
        "violence": 0.008375612087547779,
        "violence_graphic": 2.8763401132891886e-05,
        "self-harm": 0.00043840022408403456,
        "sexual/minors": 0.00018241720681544393,
        "hate/threatening": 1.1130881830467843e-05,
        "violence/graphic": 2.7211604901822284e-05,
        "self-harm/intent": 0.00026327319210395217,
        "self-harm/instructions": 0.00023905260604806244,
        "harassment/threatening": 0.0012845908058807254
    }
}

Notes

This model is currently configured to work only with English text.
Future updates may include support for additional languages.

ifmain
/

ModerationBERT-ML-En