metadata

language:
  - ru
  - ru-RU
tags:
  - token-classification
license: apache-2.0
widget:
  - text: Ёпта, меня зовут придурок и я живу в жопе

RuBERTConv Toxic Editor

Model description

Tagging model for detoxification based on rubert-base-cased-conversational.

4 possible classes:

Equal = save tokens
Replace = replace tokens with mask
Delete = remove tokens
Insert = insert mask before tokens

Use in pair with mask filler.

Intended uses & limitations

How to use

Colab: link

import torch
from transformers import AutoTokenizer, pipeline

tagger_model_name = "IlyaGusev/rubertconv_toxic_editor"

device = "cuda" if torch.cuda.is_available() else "cpu"
device_num = 0 if device == "cuda" else -1
tagger_pipe = pipeline(
    "token-classification",
    model=tagger_model_name,
    tokenizer=tagger_model_name,
    framework="pt",
    device=device_num,
    aggregation_strategy="max"
)

text = "..."
tagger_predictions = tagger_pipe([text], batch_size=1)
sample_predictions = tagger_predictions[0]
print(sample_predictions)

Training data

Dataset: russe_detox_2022

Training procedure

Parallel corpus convertion: compute_tags.py
Training script: train.py
Pipeline step: dvc.yaml, train_marker

Eval results

TBA