sberbank-ai commited on
Commit
6dedaff
1 Parent(s): fc14f6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -0
README.md CHANGED
@@ -1,3 +1,39 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - ru
5
+ tags:
6
+ - PyTorch
7
+ - Tensorflow
8
+ - Transformers
9
  ---
10
+
11
+ # RU-ELECTRA small model (cased) for Sentence Embeddings in Russian language.
12
+
13
+ For better quality, use mean token embeddings.
14
+ ## Usage (HuggingFace Models Repository)
15
+ You can use the model directly from the model repository to compute sentence embeddings:
16
+ ```python
17
+ from transformers import AutoTokenizer, AutoModel
18
+ import torch
19
+ #Mean Pooling - Take attention mask into account for correct averaging
20
+ def mean_pooling(model_output, attention_mask):
21
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
22
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
23
+ sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
24
+ sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
25
+ return sum_embeddings / sum_mask
26
+ #Sentences we want sentence embeddings for
27
+ sentences = ['Привет! Как твои дела?',
28
+ 'А правда, что 42 твое любимое число?']
29
+ #Load AutoModel from huggingface model repository
30
+ tokenizer = AutoTokenizer.from_pretrained("Andrilko/ru_s_electra_small")
31
+ model = AutoModel.from_pretrained("Andrilko/ru_s_electra_small")
32
+ #Tokenize sentences
33
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=24, return_tensors='pt')
34
+ #Compute token embeddings
35
+ with torch.no_grad():
36
+ model_output = model(**encoded_input)
37
+ #Perform pooling. In this case, mean pooling
38
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
39
+ ```