|
--- |
|
language: ["ru"] |
|
tags: |
|
- russian |
|
license: mit |
|
--- |
|
|
|
This is the [rut5-base](https://huggingface.co/cointegrated/rut5-base) model, with the decoder fine-tuned to recover (approximately) Russian sentences from their [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) embeddings. Details are [here](https://habr.com/ru/post/677618/) (in Russian). |
|
|
|
It can be used, for example, for: |
|
- Paraphrasing Russian sentences; |
|
- Translating from the 109 LaBSE languages to Russian; |
|
- Summarizing a collection of sentences with a single sentence; |
|
- Interpolating between sentences; |
|
- Few-shot text style transfer (including cross-lingual). |
|
|
|
Example code: |
|
```python |
|
import torch |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModel |
|
from transformers.modeling_outputs import BaseModelOutput |
|
|
|
enc_tokenizer = AutoTokenizer.from_pretrained('cointegrated/LaBSE-en-ru') |
|
encoder = AutoModel.from_pretrained('cointegrated/LaBSE-en-ru') |
|
|
|
dec_tokenizer = AutoTokenizer.from_pretrained('cointegrated/rut5-base-labse-decoder') |
|
decoder = AutoModelForSeq2SeqLM.from_pretrained('cointegrated/rut5-base-labse-decoder') |
|
|
|
def encode(texts): |
|
encoded_input = enc_tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt') |
|
with torch.no_grad(): |
|
model_output = encoder(**encoded_input.to(encoder.device)) |
|
embeddings = model_output.pooler_output |
|
embeddings = torch.nn.functional.normalize(embeddings) |
|
return embeddings |
|
|
|
# encode some texts into vectors |
|
embeddings = encode([ |
|
"4 декабря 2000 года", |
|
"Давно такого не читала, очень хорошо пишешь!", |
|
"Я тогда не понимала, что происходит, не понимаю и сейчас.", |
|
"London is the capital of Great Britain.", |
|
]) |
|
print(embeddings.shape) |
|
# torch.Size([4, 768]) |
|
|
|
# now try to recover the texts from the vectors |
|
out = decoder.generate( |
|
encoder_outputs=BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1)), |
|
max_length=256, |
|
repetition_penalty=3.0, |
|
) |
|
for tokens in out: |
|
print(dec_tokenizer.decode(tokens, skip_special_tokens=True)) |
|
# После 4 декабря 2000 года |
|
# Не так давно, это многое читала! |
|
# Я не понимала того, что происходит сейчас тогда, дальше. |
|
# Британская столица Англии. |
|
``` |