cointegrated commited on
Commit
a237fbd
1 Parent(s): 9b2af4a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ["ru"]
3
+ tags:
4
+ - russian
5
+ license: mit
6
+ ---
7
+
8
+ This is the [rut5-base](https://huggingface.co/cointegrated/rut5-base) model, with the decoder fine-tuned to recover (approximately) Russian sentences from their [LaBSE](https://huggingface.co/setu4993/LaBSE) embeddings.
9
+
10
+ Usage:
11
+ ```python
12
+ import torch
13
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModel
14
+ from transformers.modeling_outputs import BaseModelOutput
15
+
16
+ enc_tokenizer = AutoTokenizer.from_pretrained('cointegrated/LaBSE-en-ru')
17
+ encoder = AutoModel.from_pretrained('cointegrated/LaBSE-en-ru')
18
+
19
+ dec_tokenizer = AutoTokenizer.from_pretrained('cointegrated/rut5-base-labse-decoder')
20
+ decoder = AutoModelForSeq2SeqLM.from_pretrained('cointegrated/rut5-base-labse-decoder')
21
+
22
+ def encode(texts):
23
+ encoded_input = enc_tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
24
+ with torch.no_grad():
25
+ model_output = encoder(**encoded_input.to(encoder.device))
26
+ embeddings = model_output.pooler_output
27
+ embeddings = torch.nn.functional.normalize(embeddings)
28
+ return embeddings
29
+
30
+ # encode some texts into vectors
31
+ embeddings = encode([
32
+ "4 декабря 2000 года",
33
+ "Давно такого не читала, очень хорошо пишешь!",
34
+ "Я тогда не понимала, что происходит, не понимаю и сейчас.",
35
+ ])
36
+ print(embeddings.shape)
37
+ # torch.Size([3, 768])
38
+
39
+ # now try to recover the texts from the vectors
40
+ out = decoder.generate(
41
+ encoder_outputs=BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1)),
42
+ max_length=256,
43
+ repetition_penalty=3.0,
44
+ )
45
+ for tokens in out:
46
+ print(dec_tokenizer.decode(tokens, skip_special_tokens=True))
47
+ # После 2 декабря 2000 года
48
+ # Не так давно ты это читала, нехорошо!
49
+ # Я не понимала, что происходит сейчас и тогда.
50
+ ```