cointegrated commited on
Commit
e3cbd20
1 Parent(s): 1795234

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ tags:
5
+ - sentence-similarity
6
+ - text-classification
7
+ - paraphrase-detection
8
+ datasets:
9
+ - merionum/ru_paraphraser
10
+ - ivkrotova/rupaws_dataset
11
+ - "a private dataset of manual evaluation of text detoxification"
12
+ ---
13
+
14
+ This is a [ruBERT-conversational](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational) model trained on the mixture of 3 paraphrase detection datasets:
15
+ - [ru_paraphraser](https://huggingface.co/merionum/ru_paraphraser)
16
+ - [RuPAWS](https://github.com/ivkrotova/rupaws_dataset)
17
+ - A dataset containing crowdsourced evaluation of content preservation in Russian text detoxification by [Dementieva et al, 2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf).
18
+
19
+ Training notebook: `task_oriented_TST/similarity/cross_encoders/russian/train_russian_paraphrase_detector__fixed.ipynb` (in a private repo).
20
+
21
+ Training parameters:
22
+ * optimizer: Adam
23
+ * `lr=1e-5`
24
+ * `batch_size=32`
25
+ * `epochs=3`
26
+
27
+ ROC AUC on the development data:
28
+ ```
29
+ source score
30
+ detox 0.821665
31
+ paraphraser 0.848287
32
+ rupaws_qqp 0.761481
33
+ rupaws_wiki 0.844093
34
+ ```