File size: 1,628 Bytes
2288d9e
 
 
 
 
 
 
 
 
b32c7b5
5122e9f
b32c7b5
 
 
 
 
 
 
 
 
 
 
 
1f2d9ad
 
 
 
 
 
 
 
 
 
 
b32c7b5
fdac820
5122e9f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
language: ["ru"]
tags:
- sentence-similarity
- text-classification
datasets:
- merionum/ru_paraphraser
---

This is a version of paraphrase detector by DeepPavlov ([details in the documentation](http://docs.deeppavlov.ai/en/master/features/overview.html#ranking-model-docs)) ported to the `Transformers` format. 

All credit goes to the authors of DeepPavlov.

The model has been trained on the dataset from http://paraphraser.ru/. 

It classifies texts as paraphrases (class 1) or non-paraphrases (class 0).

```python
import torch
from transformers import AutoModelForSequenceClassification, BertTokenizer
model_name = 'cointegrated/rubert-base-cased-dp-paraphrase-detection'
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda()
tokenizer = BertTokenizer.from_pretrained(model_name)

def compare_texts(text1, text2):
    batch = tokenizer(text1, text2, return_tensors='pt').to(model.device)
    with torch.inference_mode():
        proba = torch.softmax(model(**batch).logits, -1).cpu().numpy()
    return proba[0] # p(non-paraphrase), p(paraphrase)

print(compare_texts('Сегодня на улице хорошая погода', 'Сегодня на улице отвратительная погода'))
# [0.7056226 0.2943774]
print(compare_texts('Сегодня на улице хорошая погода', 'Отличная погодка сегодня выдалась'))
# [0.16524374 0.8347562 ]
```

P.S. In the DeepPavlov repository, the tokenizer uses `max_seq_length=64`. 
This model, however, uses `model_max_length=512`. 
Therefore, the results on long texts may be inadequate.