--- language: - ru tags: - token-classification license: apache-2.0 widget: - text: Ёпта, меня зовут придурок и я живу в жопе --- # RuBERTConv Toxic Editor ## Model description Tagging model for detoxification based on [rubert-base-cased-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational). 4 possible classes: - Equal = save tokens - Replace = replace tokens with mask - Delete = remove tokens - Insert = insert mask before tokens Use in pair with [mask filler](https://huggingface.co/IlyaGusev/sber_rut5_filler). ## Intended uses & limitations #### How to use Colab: [link](https://colab.research.google.com/drive/1NUSO1QGlDgD-IWXa2SpeND089eVxrCJW) ```python import torch from transformers import AutoTokenizer, pipeline tagger_model_name = "IlyaGusev/rubertconv_toxic_editor" device = "cuda" if torch.cuda.is_available() else "cpu" device_num = 0 if device == "cuda" else -1 tagger_pipe = pipeline( "token-classification", model=tagger_model_name, tokenizer=tagger_model_name, framework="pt", device=device_num, aggregation_strategy="max" ) text = "..." tagger_predictions = tagger_pipe([text], batch_size=1) sample_predictions = tagger_predictions[0] print(sample_predictions) ``` ## Training data - Dataset: [russe_detox_2022](https://github.com/skoltech-nlp/russe_detox_2022/tree/main/data) ## Training procedure - Parallel corpus convertion: [compute_tags.py](https://github.com/IlyaGusev/rudetox/blob/main/rudetox/marker/compute_tags.py) - Training script: [train.py](https://github.com/IlyaGusev/rudetox/blob/main/rudetox/marker/train.py) - Pipeline step: [dvc.yaml, train_marker](https://github.com/IlyaGusev/rudetox/blob/main/dvc.yaml#L367) ## Eval results TBA