--- language: ["ru"] tags: - russian license: mit --- This is the [rut5-base](https://huggingface.co/cointegrated/rut5-base) model, with the decoder fine-tuned to recover (approximately) Russian sentences from their [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) embeddings. Details are [here](https://habr.com/ru/post/677618/) (in Russian). It can be used, for example, for: - Paraphrasing Russian sentences; - Translating from the 109 LaBSE languages to Russian; - Summarizing a collection of sentences with a single sentence; - Interpolating between sentences; - Few-shot text style transfer (including cross-lingual). Example code: ```python import torch from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModel from transformers.modeling_outputs import BaseModelOutput enc_tokenizer = AutoTokenizer.from_pretrained('cointegrated/LaBSE-en-ru') encoder = AutoModel.from_pretrained('cointegrated/LaBSE-en-ru') dec_tokenizer = AutoTokenizer.from_pretrained('cointegrated/rut5-base-labse-decoder') decoder = AutoModelForSeq2SeqLM.from_pretrained('cointegrated/rut5-base-labse-decoder') def encode(texts): encoded_input = enc_tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt') with torch.no_grad(): model_output = encoder(**encoded_input.to(encoder.device)) embeddings = model_output.pooler_output embeddings = torch.nn.functional.normalize(embeddings) return embeddings # encode some texts into vectors embeddings = encode([ "4 декабря 2000 года", "Давно такого не читала, очень хорошо пишешь!", "Я тогда не понимала, что происходит, не понимаю и сейчас.", "London is the capital of Great Britain.", ]) print(embeddings.shape) # torch.Size([4, 768]) # now try to recover the texts from the vectors out = decoder.generate( encoder_outputs=BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1)), max_length=256, repetition_penalty=3.0, ) for tokens in out: print(dec_tokenizer.decode(tokens, skip_special_tokens=True)) # После 4 декабря 2000 года # Не так давно, это многое читала! # Я не понимала того, что происходит сейчас тогда, дальше. # Британская столица Англии. ```