IlyaGusev commited on
Commit
d917d91
1 Parent(s): e1bc7d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -1
README.md CHANGED
@@ -8,4 +8,45 @@ license: apache-2.0
8
 
9
  ---
10
 
11
- # RuBERTConv Toxic Classifier
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  ---
10
 
11
+ # RuBERTConv Toxic Classifier
12
+
13
+ ## Model description
14
+
15
+ Based on [rubert-base-cased-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational) model
16
+
17
+ ## Intended uses & limitations
18
+
19
+ #### How to use
20
+
21
+ ```python
22
+ from transformers import pipeline
23
+
24
+ model_name = "MindfulSquirrel/rubertconv_toxic_clf"
25
+ pipe = pipeline("text-classification", model=model_name, tokenizer=model_name, framework="pt")
26
+
27
+ text = "Ты придурок из интернета"
28
+ pipe([text])
29
+ ```
30
+
31
+ ## Training data
32
+
33
+ Datasets:
34
+ - [2ch]( https://www.kaggle.com/blackmoon/russian-language-toxic-comments)
35
+ - [Odnoklassniki](https://www.kaggle.com/alexandersemiletov/toxic-russian-comments)
36
+ - [Toloka Persona Chat Rus](https://toloka.ai/ru/datasets)
37
+ - [Koziev's Conversations](https://github.com/Koziev/NLP_Datasets/blob/master/Conversations/Data) with [toxic words vocabulary](https://www.dropbox.com/s/ou6lx03b10yhrfl/bad_vocab.txt.tar.gz)
38
+
39
+ Augmentations:
40
+ - ё -> е
41
+ - Remove or add "?" or "!"
42
+ - Fix CAPS
43
+ - Concatenate toxic and non-toxic texts
44
+ - Concatenate two non-toxic texts
45
+ - Add toxic words from vocabulary
46
+ - Add typos
47
+ - Mask toxic words with "*", "@", "$"
48
+
49
+
50
+ ## Training procedure
51
+
52
+ TBA