RuBERT-Toxic

RuBERT-Toxic is a RuBERT model fine-tuned on Kaggle Russian Language Toxic Comments Dataset. You can find a detailed description of the data used and the fine-tuning process in this article. You can also find this information at GitHub.

System P R F1
MNB-Toxic 87.01% 81.22% 83.21%
M-BERTBase-Toxic 91.19% 91.10% 91.15%
RuBERT-Toxic 91.91% 92.51% 92.20%
M-USECNN-Toxic 89.69% 90.14% 89.91%
M-USETrans-Toxic 90.85% 91.92% 91.35%

We fine-tuned two versions of Multilingual Universal Sentence Encoder (M-USE), Multilingual Bidirectional Encoder Representations from Transformers (M-BERT) and RuBERT for toxic comments detection in Russian. Fine-tuned RuBERT-Toxic achieved F1 = 92.20%, demonstrating the best classification score.

Toxic Comments Dataset

Kaggle Russian Language Toxic Comments Dataset is the collection of Russian-language annotated comments from 2ch and Pikabu, which was published on Kaggle in 2019. It consists of 14412 comments, where 4826 texts were labelled as toxic, and 9586 were labelled as non-toxic. The average length of comments is ~175 characters; the minimum length is 21, and the maximum is 7403.

Citation

If you find this repository helpful, feel free to cite our publication:

@INPROCEEDINGS{Smetanin2020Toxic,
  author={Sergey Smetanin},
  booktitle={Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2020”},
  title={Toxic Comments Detection in Russian},
  year={2020},
  doi={10.28995/2075-7182-2020-19-1149-1159}
} 
Downloads last month
696
Hosted inference API
Text Classification