Update README.md
Browse files
README.md
CHANGED
@@ -1,66 +1,68 @@
|
|
1 |
-
---
|
2 |
-
language:
|
3 |
-
- ru
|
4 |
-
- ru-RU
|
5 |
-
tags:
|
6 |
-
- token-classification
|
7 |
-
license: apache-2.0
|
8 |
-
widget:
|
9 |
-
- text: Ёпта, меня зовут придурок и я живу в жопе
|
10 |
-
|
11 |
-
---
|
12 |
-
|
13 |
-
# RuBERTConv Toxic Editor
|
14 |
-
|
15 |
-
## Model description
|
16 |
-
|
17 |
-
Tagging model for detoxification based on [rubert-base-cased-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational).
|
18 |
-
|
19 |
-
4 possible classes:
|
20 |
-
- Equal = save tokens
|
21 |
-
- Replace = replace tokens with mask
|
22 |
-
- Delete = remove tokens
|
23 |
-
- Insert = insert mask before tokens
|
24 |
-
|
25 |
-
Use in pair with [mask filler](https://huggingface.co/IlyaGusev/sber_rut5_filler).
|
26 |
-
|
27 |
-
## Intended uses & limitations
|
28 |
-
|
29 |
-
#### How to use
|
30 |
-
|
31 |
-
Colab: [link](https://colab.research.google.com/drive/1NUSO1QGlDgD-IWXa2SpeND089eVxrCJW)
|
32 |
-
|
33 |
-
```python
|
34 |
-
import torch
|
35 |
-
from transformers import AutoTokenizer, pipeline
|
36 |
-
|
37 |
-
tagger_model_name = "IlyaGusev/rubertconv_toxic_editor"
|
38 |
-
|
39 |
-
device = "cuda" if torch.cuda.is_available() else "cpu"
|
40 |
-
device_num = 0 if device == "cuda" else -1
|
41 |
-
tagger_pipe = pipeline(
|
42 |
-
"token-classification",
|
43 |
-
model=tagger_model_name,
|
44 |
-
tokenizer=tagger_model_name,
|
45 |
-
framework="pt",
|
46 |
-
device=device_num,
|
47 |
-
aggregation_strategy="max"
|
48 |
-
)
|
49 |
-
|
50 |
-
text = "..."
|
51 |
-
tagger_predictions = tagger_pipe([text], batch_size=1)
|
52 |
-
sample_predictions = tagger_predictions[0]
|
53 |
-
print(sample_predictions)
|
54 |
-
```
|
55 |
-
|
56 |
-
## Training data
|
57 |
-
|
58 |
-
- Dataset: [russe_detox_2022](https://github.com/skoltech-nlp/russe_detox_2022/tree/main/data)
|
59 |
-
|
60 |
-
## Training procedure
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
|
|
|
|
66 |
TBA
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ru
|
4 |
+
- ru-RU
|
5 |
+
tags:
|
6 |
+
- token-classification
|
7 |
+
license: apache-2.0
|
8 |
+
widget:
|
9 |
+
- text: Ёпта, меня зовут придурок и я живу в жопе
|
10 |
+
|
11 |
+
---
|
12 |
+
|
13 |
+
# RuBERTConv Toxic Editor
|
14 |
+
|
15 |
+
## Model description
|
16 |
+
|
17 |
+
Tagging model for detoxification based on [rubert-base-cased-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational).
|
18 |
+
|
19 |
+
4 possible classes:
|
20 |
+
- Equal = save tokens
|
21 |
+
- Replace = replace tokens with mask
|
22 |
+
- Delete = remove tokens
|
23 |
+
- Insert = insert mask before tokens
|
24 |
+
|
25 |
+
Use in pair with [mask filler](https://huggingface.co/IlyaGusev/sber_rut5_filler).
|
26 |
+
|
27 |
+
## Intended uses & limitations
|
28 |
+
|
29 |
+
#### How to use
|
30 |
+
|
31 |
+
Colab: [link](https://colab.research.google.com/drive/1NUSO1QGlDgD-IWXa2SpeND089eVxrCJW)
|
32 |
+
|
33 |
+
```python
|
34 |
+
import torch
|
35 |
+
from transformers import AutoTokenizer, pipeline
|
36 |
+
|
37 |
+
tagger_model_name = "IlyaGusev/rubertconv_toxic_editor"
|
38 |
+
|
39 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
40 |
+
device_num = 0 if device == "cuda" else -1
|
41 |
+
tagger_pipe = pipeline(
|
42 |
+
"token-classification",
|
43 |
+
model=tagger_model_name,
|
44 |
+
tokenizer=tagger_model_name,
|
45 |
+
framework="pt",
|
46 |
+
device=device_num,
|
47 |
+
aggregation_strategy="max"
|
48 |
+
)
|
49 |
+
|
50 |
+
text = "..."
|
51 |
+
tagger_predictions = tagger_pipe([text], batch_size=1)
|
52 |
+
sample_predictions = tagger_predictions[0]
|
53 |
+
print(sample_predictions)
|
54 |
+
```
|
55 |
+
|
56 |
+
## Training data
|
57 |
+
|
58 |
+
- Dataset: [russe_detox_2022](https://github.com/skoltech-nlp/russe_detox_2022/tree/main/data)
|
59 |
+
|
60 |
+
## Training procedure
|
61 |
+
|
62 |
+
- Parallel corpus convertion: [compute_tags.py](https://github.com/IlyaGusev/rudetox/blob/main/rudetox/marker/compute_tags.py)
|
63 |
+
- Training script: [train.py](https://github.com/IlyaGusev/rudetox/blob/main/rudetox/marker/train.py)
|
64 |
+
- Pipeline step: [dvc.yaml, train_marker](https://github.com/IlyaGusev/rudetox/blob/main/dvc.yaml#L367)
|
65 |
+
|
66 |
+
## Eval results
|
67 |
+
|
68 |
TBA
|