File size: 2,037 Bytes
8b22cbf
 
2cd86e2
 
 
 
 
8b22cbf
2cd86e2
 
 
e06323f
da6ad08
2cd86e2
da6ad08
2cd86e2
da6ad08
e06323f
da6ad08
2cd86e2
 
 
da6ad08
2cd86e2
 
 
 
 
 
 
dfc02d5
2cd86e2
 
 
da6ad08
2cd86e2
5ecce4a
 
da6ad08
2cd86e2
 
da6ad08
2cd86e2
 
 
da6ad08
2cd86e2
 
 
 
 
68e1689
49cf776
2cd86e2
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
license: mit
language:
- ru
tags:
- PyTorch
- Transformers
---

# ruELECTRA large model multitask (cased) for Sentence Embeddings in Russian language.

About model family https://arxiv.org/abs/2003.10555

## Usage (HuggingFace Models Repository)

You can use the model directly from the model repository to compute sentence embeddings:

For better quality, use mean token embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

#Sentences we want sentence embeddings for
sentences = ['Привет! Как твои дела?',
             'А правда, что 42 твое любимое число?']

#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("ai-forever/ruElectra-large")
model = AutoModel.from_pretrained("ai-forever/ruElectra-large")

#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=24, return_tensors='pt')

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
```

# Authors
+ [SaluteDevices](https://sberdevices.ru/) RnD Team.
+ Aleksandr Abramov: [HF profile](https://huggingface.co/Andrilko), [Github](https://github.com/Ab1992ao), [Kaggle Competitions Master](https://www.kaggle.com/andrilko);
+ Mark Baushenko: [HF profile](https://huggingface.co/e0xexrazy);
+ Artem Snegirev: [HF profile](https://huggingface.co/artemsnegirev)