--- language: - ru tags: - sentence-similarity - text-classification - paraphrase-detection datasets: - merionum/ru_paraphraser - ivkrotova/rupaws_dataset - "a private dataset of manual evaluation of text detoxification" --- This is a [ruBERT-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational) model trained on the mixture of 3 paraphrase detection datasets: - [ru_paraphraser](https://huggingface.co/merionum/ru_paraphraser) - [RuPAWS](https://github.com/ivkrotova/rupaws_dataset) - A dataset containing crowdsourced evaluation of content preservation in Russian text detoxification by [Dementieva et al, 2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf). Training notebook: `task_oriented_TST/similarity/cross_encoders/russian/train_russian_paraphrase_detector__fixed.ipynb` (in a private repo). Training parameters: * optimizer: Adam * `lr=1e-5` * `batch_size=32` * `epochs=3` ROC AUC on the development data: ``` source score detox 0.821665 paraphraser 0.848287 rupaws_qqp 0.761481 rupaws_wiki 0.844093 ```