Bert-based classifier (finetuned from Conversational Rubert) trained on merge of Russian Language Toxic Comments dataset collected from 2ch.hk and Toxic Russian Comments dataset collected from ok.ru.

The datasets were merged, shuffled, and split into train, dev, test splits in 80-10-10 proportion. The metrics obtained from test dataset is as follows

precision recall f1-score support
0 0.98 0.99 0.98 21384
1 0.94 0.92 0.93 4886
accuracy 0.97 26270
macro avg 0.96 0.96 0.96 26270
weighted avg 0.97 0.97 0.97 26270

How to use

from transformers import BertTokenizer, BertForSequenceClassification

# load tokenizer and model weights
tokenizer = BertTokenizer.from_pretrained('s-nlp/russian_toxicity_classifier')
model = BertForSequenceClassification.from_pretrained('s-nlp/russian_toxicity_classifier')

# prepare the input
batch = tokenizer.encode('ты супер', return_tensors='pt')

# inference
model(batch)

Citation

To acknowledge our work, please, use the corresponding citation:

@article{dementieva2022russe,
  title={RUSSE-2022: Findings of the First Russian Detoxification Shared Task Based on Parallel Corpora},
  author={Dementieva, Daryna and Logacheva, Varvara and Nikishina, Irina and Fenogenova, Alena and Dale, David and Krotova, Irina and Semenov, Nikita and Shavrina, Tatiana and Panchenko, Alexander}
}

Licensing Information

This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good.

Downloads last month
11,702
Safetensors
Model size
178M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for s-nlp/russian_toxicity_classifier

Finetuned
(5)
this model
Finetunes
1 model

Spaces using s-nlp/russian_toxicity_classifier 2