README.md · sismetanin/rubert-toxic-pikabu-2ch at main

metadata

language:
  - ru
tags:
  - toxic comments classification

RuBERT-Toxic

RuBERT-Toxic is a RuBERT model fine-tuned on Kaggle Russian Language Toxic Comments Dataset. You can find a detailed description of the data used and the fine-tuning process in this article. You can also find this information at GitHub.

System	P	R	F₁
MNB-Toxic	87.01%	81.22%	83.21%
M-BERT_Base-Toxic	91.19%	91.10%	91.15%
RuBERT-Toxic	91.91%	92.51%	92.20%
M-USE_CNN-Toxic	89.69%	90.14%	89.91%
M-USE_Trans-Toxic	90.85%	91.92%	91.35%

We fine-tuned two versions of Multilingual Universal Sentence Encoder (M-USE), Multilingual Bidirectional Encoder Representations from Transformers (M-BERT) and RuBERT for toxic comments detection in Russian. Fine-tuned RuBERT-Toxic achieved F₁ = 92.20%, demonstrating the best classification score.

Toxic Comments Dataset

Kaggle Russian Language Toxic Comments Dataset is the collection of Russian-language annotated comments from 2ch and Pikabu, which was published on Kaggle in 2019. It consists of 14412 comments, where 4826 texts were labelled as toxic, and 9586 were labelled as non-toxic. The average length of comments is ~175 characters; the minimum length is 21, and the maximum is 7403.

Citation

If you find this repository helpful, feel free to cite our publication:

@INPROCEEDINGS{Smetanin2020Toxic,
  author={Sergey Smetanin},
  booktitle={Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2020”},
  title={Toxic Comments Detection in Russian},
  year={2020},
  doi={10.28995/2075-7182-2020-19-1149-1159}
}