cointegrated
commited on
Commit
•
e3cbd20
1
Parent(s):
1795234
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ru
|
4 |
+
tags:
|
5 |
+
- sentence-similarity
|
6 |
+
- text-classification
|
7 |
+
- paraphrase-detection
|
8 |
+
datasets:
|
9 |
+
- merionum/ru_paraphraser
|
10 |
+
- ivkrotova/rupaws_dataset
|
11 |
+
- "a private dataset of manual evaluation of text detoxification"
|
12 |
+
---
|
13 |
+
|
14 |
+
This is a [ruBERT-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational) model trained on the mixture of 3 paraphrase detection datasets:
|
15 |
+
- [ru_paraphraser](https://huggingface.co/merionum/ru_paraphraser)
|
16 |
+
- [RuPAWS](https://github.com/ivkrotova/rupaws_dataset)
|
17 |
+
- A dataset containing crowdsourced evaluation of content preservation in Russian text detoxification by [Dementieva et al, 2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf).
|
18 |
+
|
19 |
+
Training notebook: `task_oriented_TST/similarity/cross_encoders/russian/train_russian_paraphrase_detector__fixed.ipynb` (in a private repo).
|
20 |
+
|
21 |
+
Training parameters:
|
22 |
+
* optimizer: Adam
|
23 |
+
* `lr=1e-5`
|
24 |
+
* `batch_size=32`
|
25 |
+
* `epochs=3`
|
26 |
+
|
27 |
+
ROC AUC on the development data:
|
28 |
+
```
|
29 |
+
source score
|
30 |
+
detox 0.821665
|
31 |
+
paraphraser 0.848287
|
32 |
+
rupaws_qqp 0.761481
|
33 |
+
rupaws_wiki 0.844093
|
34 |
+
```
|