File size: 1,304 Bytes
e3cbd20
 
 
 
 
 
 
 
 
 
 
 
76dcef5
e3cbd20
 
 
76dcef5
 
e3cbd20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4645cf5
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
language:
- ru
tags:
- sentence-similarity
- text-classification
- paraphrase-detection
datasets:
- merionum/ru_paraphraser
---

This is a [ruBERT-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational) model trained on the mixture of 3 paraphrase detection datasets:
- [ru_paraphraser](https://huggingface.co/merionum/ru_paraphraser) (with classes -1 and 0 merged)
- [RuPAWS](https://github.com/ivkrotova/rupaws_dataset)
- A dataset containing crowdsourced evaluation of content preservation in Russian text detoxification by [Dementieva et al, 2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf).

The model can be used to assess semantic similarity of Russian sentences. 

Training notebook: `task_oriented_TST/similarity/cross_encoders/russian/train_russian_paraphrase_detector__fixed.ipynb` (in a private repo). 

Training parameters: 
* optimizer: Adam
* `lr=1e-5`
* `batch_size=32`
* `epochs=3`

ROC AUC on the development data:
```
source         score
detox          0.821665
paraphraser    0.848287
rupaws_qqp     0.761481
rupaws_wiki    0.844093
```

Pleas see also the documentation of [SkolkovoInstitute/ruRoberta-large-paraphrase-v1](https://huggingface.co/SkolkovoInstitute/ruRoberta-large-paraphrase-v1) that performs better on this task.