Text Classification
Safetensors
Bulgarian
bert
Not-For-All-Audiences
medical
File size: 1,393 Bytes
b64157d
 
 
 
 
 
 
 
 
 
 
934475b
 
 
 
 
 
 
 
 
 
b64157d
031049e
 
 
 
 
 
 
 
 
 
934475b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language:
- bg
metrics:
- f1
- accuracy
- precision
- recall
base_model:
- rmihaylov/bert-base-bg
pipeline_tag: text-classification
license: apache-2.0
datasets:
- sofia-uni/toxic-data-bg
- wikimedia/wikipedia
- oscar-corpus/oscar
- petkopetkov/chitanka
tags:
- bert
- not-for-all-audiences
- medical
---
Toxic language classification model of Bulgarian language, based on the [bert-base-bg](https://huggingface.co/rmihaylov/bert-base-bg) model. 

The model classifies between 4 classes: Toxic, MedicalTerminology, NonToxic, MinorityGroup. 

Classification report: 

| Accuracy | Precision | Recall | F1 Score | Loss Function |
|----------|-----------|--------|----------|---------------|
| 0.85     | 0.86      | 0.85   | 0.85     | 0.43          |

More information [in the paper](https://www.researchgate.net/publication/388842558_Detecting_Toxic_Language_Ontology_and_BERT-based_Approaches_for_Bulgarian_Text).


# Code and usage
For training files and information how to use the model, refer to the [GitHub repository of the project](https://github.com/TsvetoslavVasev/toxic-language-classification).


# Reference

If you use the pipeline in your academic project, please cite as:

```bibtex
@article
{berbatova2025detecting,
title={Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text},
author={Berbatova, Melania and Vasev, Tsvetoslav},
year={2025}
}
```