Edit model card

This is the rut5-base model, with the decoder fine-tuned to recover (approximately) Russian sentences from their LaBSE embeddings. Details are here (in Russian).

It can be used, for example, for:

  • Paraphrasing Russian sentences;
  • Translating from the 109 LaBSE languages to Russian;
  • Summarizing a collection of sentences with a single sentence;
  • Interpolating between sentences;
  • Few-shot text style transfer (including cross-lingual).

Example code:

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModel
from transformers.modeling_outputs import BaseModelOutput

enc_tokenizer = AutoTokenizer.from_pretrained('cointegrated/LaBSE-en-ru')
encoder = AutoModel.from_pretrained('cointegrated/LaBSE-en-ru')

dec_tokenizer = AutoTokenizer.from_pretrained('cointegrated/rut5-base-labse-decoder')
decoder = AutoModelForSeq2SeqLM.from_pretrained('cointegrated/rut5-base-labse-decoder')

def encode(texts):
    encoded_input = enc_tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
    with torch.no_grad():
        model_output = encoder(**encoded_input.to(encoder.device))
        embeddings = model_output.pooler_output
        embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings
 
# encode some texts into vectors
embeddings = encode([
    "4 декабря 2000 года",
    "Давно такого не читала, очень хорошо пишешь!",
    "Я тогда не понимала, что происходит, не понимаю и сейчас.",
    "London is the capital of Great Britain.",
])
print(embeddings.shape)
# torch.Size([4, 768])

# now try to recover the texts from the vectors
out = decoder.generate(
    encoder_outputs=BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1)), 
    max_length=256, 
    repetition_penalty=3.0,
)
for tokens in out:
    print(dec_tokenizer.decode(tokens, skip_special_tokens=True))
# После 4 декабря 2000 года
# Не так давно, это многое читала!
# Я не понимала того, что происходит сейчас тогда, дальше.
# Британская столица Англии.
Downloads last month
87
Safetensors
Model size
244M params
Tensor type
F32
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.