File size: 4,905 Bytes
d02c985
 
724a72c
 
 
 
 
 
 
d02c985
 
 
 
 
4bdda11
 
d02c985
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a68f96b
d02c985
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a89a05
d02c985
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a89a05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d02c985
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
license: apache-2.0
datasets:
- ifmain/text-moderation-410K
language:
- en
metrics:
- accuracy
pipeline_tag: text-classification
---
# ModerationBERT-ML-En

**ModerationBERT-ML-En** is a moderation model based on `bert-base-multilingual-cased`. This model is designed to perform text moderation tasks, specifically categorizing text into 18 different categories. It currently works only with English text.

[Check out the new version of the model! Even more accurate and better!](https://huggingface.co/ifmain/open-text-moderation-7)

## Dataset

The model was trained and fine-tuned using the [text-moderation-410K](https://huggingface.co/datasets/ifmain/text-moderation-410K) dataset. This dataset contains a wide variety of text samples labeled with different moderation categories.

## Model Description

ModerationBERT-ML-En uses the BERT architecture to classify text into the following categories:
- harassment
- harassment_threatening
- hate
- hate_threatening
- self_harm
- self_harm_instructions
- self_harm_intent
- sexual
- sexual_minors
- violence
- violence_graphic
- self-harm
- sexual/minors
- hate/threatening
- violence/graphic
- self-harm/intent
- self-harm/instructions
- harassment/threatening

## Training and Fine-Tuning

The model was trained using a 95% subset of the dataset for training and a 5% subset for evaluation. The training was performed in two stages:

1. **Initial Training**: The classifier layer was trained with frozen BERT layers.
2. **Fine-Tuning**: The top two layers of the BERT model were unfrozen and the entire model was fine-tuned.

## Installation

To use ModerationBERT-ML-En, you will need to install the `transformers` library from Hugging Face and `torch`.

```bash
pip install transformers torch
```

## Usage

Here is an example of how to use ModerationBERT-ML-En to predict the moderation categories for a given text:

```python
import json
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load the tokenizer and model
model_name = "ifmain/ModerationBERT-ML-En"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=18)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

def predict(text, model, tokenizer):
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=128,
        return_token_type_ids=False,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    model.eval()
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    predictions = torch.sigmoid(outputs.logits)  # Convert logits to probabilities
    return predictions

# Example usage
new_text = "Fuck off stuped trash"
predictions = predict(new_text, model, tokenizer)

# Define the categories
categories = ['harassment', 'harassment_threatening', 'hate', 'hate_threatening', 
              'self_harm', 'self_harm_instructions', 'self_harm_intent', 'sexual', 
              'sexual_minors', 'violence', 'violence_graphic', 'self-harm', 
              'sexual/minors', 'hate/threatening', 'violence/graphic', 
              'self-harm/intent', 'self-harm/instructions', 'harassment/threatening']

# Convert predictions to a dictionary
category_scores = {categories[i]: predictions[0][i].item() for i in range(len(categories))}

output = {
    "text": new_text,
    "category_scores": category_scores
}

# Print the result as a JSON with indentation
print(json.dumps(output, indent=4, ensure_ascii=False))
```

Output:

```json
{
    "text": "Fuck off stuped trash",
    "category_scores": {
        "harassment": 0.9272650480270386,
        "harassment_threatening": 0.0013139015063643456,
        "hate": 0.011709265410900116,
        "hate_threatening": 1.1083522622357123e-05,
        "self_harm": 0.00039102151640690863,
        "self_harm_instructions": 0.0002464024000801146,
        "self_harm_intent": 0.00031603744719177485,
        "sexual": 0.020730027928948402,
        "sexual_minors": 0.00018848323088604957,
        "violence": 0.008375612087547779,
        "violence_graphic": 2.8763401132891886e-05,
        "self-harm": 0.00043840022408403456,
        "sexual/minors": 0.00018241720681544393,
        "hate/threatening": 1.1130881830467843e-05,
        "violence/graphic": 2.7211604901822284e-05,
        "self-harm/intent": 0.00026327319210395217,
        "self-harm/instructions": 0.00023905260604806244,
        "harassment/threatening": 0.0012845908058807254
    }
}
```

## Notes

- This model is currently configured to work only with English text.
- Future updates may include support for additional languages.