mlsa-iai-msu-lab commited on
Commit
06220cf
1 Parent(s): 4ad6b55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -1
README.md CHANGED
@@ -17,4 +17,65 @@ tags:
17
  - transformers
18
  widget:
19
  - text: Метод опорных векторов
20
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  - transformers
18
  widget:
19
  - text: Метод опорных векторов
20
+ ---
21
+ SciRus-tiny is the model to obtain embeddings of scientific texts in russian and english. Model was trained on [eLibrary](https://www.elibrary.ru/) data with contrastive technics described in [habr post]() and achive quite good metrics values on the [ruSciBench](https://github.com/mlsa-iai-msu-lab/ru_sci_bench/tree/main) benchmark.
22
+
23
+ ### How to get embeddings
24
+
25
+ ```python
26
+ from transformers import AutoTokenizer, AutoModel
27
+ import torch
28
+ import torch.nn.functional as F
29
+
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
32
+ model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
33
+ # model.cuda() # if you want to use a GPU
34
+
35
+ def mean_pooling(model_output, attention_mask):
36
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
37
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
38
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
39
+
40
+
41
+ def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
42
+ # Tokenize sentences
43
+ sentence = '</s>'.join([title, abstract])
44
+ encoded_input = tokenizer(
45
+ [sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
46
+
47
+ # Compute token embeddings
48
+ with torch.no_grad():
49
+ model_output = model(**encoded_input)
50
+
51
+ # Perform pooling
52
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
53
+
54
+ # Normalize embeddings
55
+ sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
56
+ return sentence_embeddings.cpu().detach().numpy()[0]
57
+
58
+ print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
59
+ # (312,)
60
+ ```
61
+
62
+ Or you can use the `sentence_transformers`:
63
+ ```Python
64
+ from sentence_transformers import SentenceTransformer
65
+
66
+
67
+ model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
68
+ embeddings = model.encode(['привет мир'])
69
+ print(embeddings[0].shape)
70
+ # (312,)
71
+ ```
72
+
73
+
74
+ ### Authors
75
+ Benchmark developed by MLSA Lab of Institute for AI, MSU.
76
+
77
+ ### Acknowledgement
78
+ We would like to thank [eLibrary](https://elibrary.ru/) for provided datasets.
79
+
80
+ ### Contact
81
+ Nikolai Gerasimenko (nikgerasimenko@gmail.com).