README.md · s-nlp/rubert-base-cased-conversational-paraphrase-v1 at 76dcef52d7b7825fbad707ecb39c99f661a03a47

metadata

language:
  - ru
tags:
  - sentence-similarity
  - text-classification
  - paraphrase-detection
datasets:
  - merionum/ru_paraphraser
  - ivkrotova/rupaws_dataset
  - a private dataset of manual evaluation of text detoxification

This is a ruBERT-conversational model trained on the mixture of 3 paraphrase detection datasets:

ru_paraphraser (with classes -1 and 0 merged)
RuPAWS
A dataset containing crowdsourced evaluation of content preservation in Russian text detoxification by Dementieva et al, 2022.

The model can be used to assess semantic similarity of Russian sentences.

Training notebook: task_oriented_TST/similarity/cross_encoders/russian/train_russian_paraphrase_detector__fixed.ipynb (in a private repo).

Training parameters:

optimizer: Adam
lr=1e-5
batch_size=32
epochs=3

ROC AUC on the development data:

source         score
detox          0.821665
paraphraser    0.848287
rupaws_qqp     0.761481
rupaws_wiki    0.844093