IlyaGusev commited on
Commit
3ed31d6
1 Parent(s): 138751a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -65
README.md CHANGED
@@ -1,66 +1,68 @@
1
- ---
2
- language:
3
- - ru
4
- - ru-RU
5
- tags:
6
- - token-classification
7
- license: apache-2.0
8
- widget:
9
- - text: Ёпта, меня зовут придурок и я живу в жопе
10
-
11
- ---
12
-
13
- # RuBERTConv Toxic Editor
14
-
15
- ## Model description
16
-
17
- Tagging model for detoxification based on [rubert-base-cased-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational).
18
-
19
- 4 possible classes:
20
- - Equal = save tokens
21
- - Replace = replace tokens with mask
22
- - Delete = remove tokens
23
- - Insert = insert mask before tokens
24
-
25
- Use in pair with [mask filler](https://huggingface.co/IlyaGusev/sber_rut5_filler).
26
-
27
- ## Intended uses & limitations
28
-
29
- #### How to use
30
-
31
- Colab: [link](https://colab.research.google.com/drive/1NUSO1QGlDgD-IWXa2SpeND089eVxrCJW)
32
-
33
- ```python
34
- import torch
35
- from transformers import AutoTokenizer, pipeline
36
-
37
- tagger_model_name = "IlyaGusev/rubertconv_toxic_editor"
38
-
39
- device = "cuda" if torch.cuda.is_available() else "cpu"
40
- device_num = 0 if device == "cuda" else -1
41
- tagger_pipe = pipeline(
42
- "token-classification",
43
- model=tagger_model_name,
44
- tokenizer=tagger_model_name,
45
- framework="pt",
46
- device=device_num,
47
- aggregation_strategy="max"
48
- )
49
-
50
- text = "..."
51
- tagger_predictions = tagger_pipe([text], batch_size=1)
52
- sample_predictions = tagger_predictions[0]
53
- print(sample_predictions)
54
- ```
55
-
56
- ## Training data
57
-
58
- - Dataset: [russe_detox_2022](https://github.com/skoltech-nlp/russe_detox_2022/tree/main/data)
59
-
60
- ## Training procedure
61
-
62
- TBA
63
-
64
- ## Eval results
65
-
 
 
66
  TBA
 
1
+ ---
2
+ language:
3
+ - ru
4
+ - ru-RU
5
+ tags:
6
+ - token-classification
7
+ license: apache-2.0
8
+ widget:
9
+ - text: Ёпта, меня зовут придурок и я живу в жопе
10
+
11
+ ---
12
+
13
+ # RuBERTConv Toxic Editor
14
+
15
+ ## Model description
16
+
17
+ Tagging model for detoxification based on [rubert-base-cased-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational).
18
+
19
+ 4 possible classes:
20
+ - Equal = save tokens
21
+ - Replace = replace tokens with mask
22
+ - Delete = remove tokens
23
+ - Insert = insert mask before tokens
24
+
25
+ Use in pair with [mask filler](https://huggingface.co/IlyaGusev/sber_rut5_filler).
26
+
27
+ ## Intended uses & limitations
28
+
29
+ #### How to use
30
+
31
+ Colab: [link](https://colab.research.google.com/drive/1NUSO1QGlDgD-IWXa2SpeND089eVxrCJW)
32
+
33
+ ```python
34
+ import torch
35
+ from transformers import AutoTokenizer, pipeline
36
+
37
+ tagger_model_name = "IlyaGusev/rubertconv_toxic_editor"
38
+
39
+ device = "cuda" if torch.cuda.is_available() else "cpu"
40
+ device_num = 0 if device == "cuda" else -1
41
+ tagger_pipe = pipeline(
42
+ "token-classification",
43
+ model=tagger_model_name,
44
+ tokenizer=tagger_model_name,
45
+ framework="pt",
46
+ device=device_num,
47
+ aggregation_strategy="max"
48
+ )
49
+
50
+ text = "..."
51
+ tagger_predictions = tagger_pipe([text], batch_size=1)
52
+ sample_predictions = tagger_predictions[0]
53
+ print(sample_predictions)
54
+ ```
55
+
56
+ ## Training data
57
+
58
+ - Dataset: [russe_detox_2022](https://github.com/skoltech-nlp/russe_detox_2022/tree/main/data)
59
+
60
+ ## Training procedure
61
+
62
+ - Parallel corpus convertion: [compute_tags.py](https://github.com/IlyaGusev/rudetox/blob/main/rudetox/marker/compute_tags.py)
63
+ - Training script: [train.py](https://github.com/IlyaGusev/rudetox/blob/main/rudetox/marker/train.py)
64
+ - Pipeline step: [dvc.yaml, train_marker](https://github.com/IlyaGusev/rudetox/blob/main/dvc.yaml#L367)
65
+
66
+ ## Eval results
67
+
68
  TBA