s-nlp
/

rubert-base-cased-conversational-paraphrase-v1

Text Classification

sentence-similarity

paraphrase-detection

Inference Endpoints

Model card Files Files and versions Community

rubert-base-cased-conversational-paraphrase-v1 / README.md

cointegrated's picture

Update README.md

4645cf5 almost 2 years ago

|

No virus

1.3 kB

	---
	language:
	- ru
	tags:
	- sentence-similarity
	- text-classification
	- paraphrase-detection
	datasets:
	- merionum/ru_paraphraser
	---

	This is a [ruBERT-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational) model trained on the mixture of 3 paraphrase detection datasets:
	- [ru_paraphraser](https://huggingface.co/merionum/ru_paraphraser) (with classes -1 and 0 merged)
	- [RuPAWS](https://github.com/ivkrotova/rupaws_dataset)
	- A dataset containing crowdsourced evaluation of content preservation in Russian text detoxification by [Dementieva et al, 2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf).

	The model can be used to assess semantic similarity of Russian sentences.

	Training notebook: `task_oriented_TST/similarity/cross_encoders/russian/train_russian_paraphrase_detector__fixed.ipynb` (in a private repo).

	Training parameters:
	* optimizer: Adam
	* `lr=1e-5`
	* `batch_size=32`
	* `epochs=3`

	ROC AUC on the development data:
	```
	source score
	detox 0.821665
	paraphraser 0.848287
	rupaws_qqp 0.761481
	rupaws_wiki 0.844093
	```

	Pleas see also the documentation of [SkolkovoInstitute/ruRoberta-large-paraphrase-v1](https://huggingface.co/SkolkovoInstitute/ruRoberta-large-paraphrase-v1) that performs better on this task.