|
--- |
|
language: |
|
- ru |
|
tags: |
|
- sentence-similarity |
|
- text-classification |
|
- paraphrase-detection |
|
datasets: |
|
- merionum/ru_paraphraser |
|
--- |
|
|
|
This is a [ruBERT-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational) model trained on the mixture of 3 paraphrase detection datasets: |
|
- [ru_paraphraser](https://huggingface.co/merionum/ru_paraphraser) (with classes -1 and 0 merged) |
|
- [RuPAWS](https://github.com/ivkrotova/rupaws_dataset) |
|
- A dataset containing crowdsourced evaluation of content preservation in Russian text detoxification by [Dementieva et al, 2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf). |
|
|
|
The model can be used to assess semantic similarity of Russian sentences. |
|
|
|
Training notebook: `task_oriented_TST/similarity/cross_encoders/russian/train_russian_paraphrase_detector__fixed.ipynb` (in a private repo). |
|
|
|
Training parameters: |
|
* optimizer: Adam |
|
* `lr=1e-5` |
|
* `batch_size=32` |
|
* `epochs=3` |
|
|
|
ROC AUC on the development data: |
|
``` |
|
source score |
|
detox 0.821665 |
|
paraphraser 0.848287 |
|
rupaws_qqp 0.761481 |
|
rupaws_wiki 0.844093 |
|
``` |
|
|
|
Pleas see also the documentation of [SkolkovoInstitute/ruRoberta-large-paraphrase-v1](https://huggingface.co/SkolkovoInstitute/ruRoberta-large-paraphrase-v1) that performs better on this task. |