cointegrated commited on
Commit
960d7f6
1 Parent(s): 0cb871a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ["ru"]
3
+ tags:
4
+ - russian
5
+ - fill-mask
6
+ - pretraining
7
+ - embeddings
8
+ - masked-lm
9
+ - tiny
10
+ license: mit
11
+ widget:
12
+ - text: "Миниатюрная модель для [MASK] разных задач."
13
+ ---
14
+ This is an updated version of [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny): a small Russian BERT-based encoder with high-quality sentence embeddings.
15
+
16
+ The differences from the previous version include:
17
+ - a larger vocabulary: 83828 tokens instead of 29564;
18
+ - larger supported sequences: 2048 instead of 512;
19
+ - sentence embeddings approximate LaBSE closer than before;
20
+ - the model is focused only on Russian.
21
+
22
+ The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.
23
+
24
+ Sentence embeddings can be produced as follows:
25
+
26
+ ```python
27
+ # pip install transformers sentencepiece
28
+ import torch
29
+ from transformers import AutoTokenizer, AutoModel
30
+ tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
31
+ model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
32
+ # model.cuda() # uncomment it if you have a GPU
33
+
34
+ def embed_bert_cls(text, model, tokenizer):
35
+ t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
36
+ with torch.no_grad():
37
+ model_output = model(**{k: v.to(model.device) for k, v in t.items()})
38
+ embeddings = model_output.last_hidden_state[:, 0, :]
39
+ embeddings = torch.nn.functional.normalize(embeddings)
40
+ return embeddings[0].cpu().numpy()
41
+
42
+ print(embed_bert_cls('привет мир', model, tokenizer).shape)
43
+ # (312,)
44
+ ```