cointegrated
/

rut5-base-labse-decoder

Text2Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

rut5-base-labse-decoder / README.md

cointegrated's picture

Update README.md

b618b3b almost 2 years ago

|

raw history blame contribute delete

No virus

2.42 kB

	---
	language: ["ru"]
	tags:
	- russian
	license: mit
	---

	This is the [rut5-base](https://huggingface.co/cointegrated/rut5-base) model, with the decoder fine-tuned to recover (approximately) Russian sentences from their [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) embeddings. Details are [here](https://habr.com/ru/post/677618/) (in Russian).

	It can be used, for example, for:
	- Paraphrasing Russian sentences;
	- Translating from the 109 LaBSE languages to Russian;
	- Summarizing a collection of sentences with a single sentence;
	- Interpolating between sentences;
	- Few-shot text style transfer (including cross-lingual).

	Example code:
	```python
	import torch
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModel
	from transformers.modeling_outputs import BaseModelOutput

	enc_tokenizer = AutoTokenizer.from_pretrained('cointegrated/LaBSE-en-ru')
	encoder = AutoModel.from_pretrained('cointegrated/LaBSE-en-ru')

	dec_tokenizer = AutoTokenizer.from_pretrained('cointegrated/rut5-base-labse-decoder')
	decoder = AutoModelForSeq2SeqLM.from_pretrained('cointegrated/rut5-base-labse-decoder')

	def encode(texts):
	encoded_input = enc_tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
	with torch.no_grad():
	model_output = encoder(**encoded_input.to(encoder.device))
	embeddings = model_output.pooler_output
	embeddings = torch.nn.functional.normalize(embeddings)
	return embeddings

	# encode some texts into vectors
	embeddings = encode([
	"4 декабря 2000 года",
	"Давно такого не читала, очень хорошо пишешь!",
	"Я тогда не понимала, что происходит, не понимаю и сейчас.",
	"London is the capital of Great Britain.",
	])
	print(embeddings.shape)
	# torch.Size([4, 768])

	# now try to recover the texts from the vectors
	out = decoder.generate(
	encoder_outputs=BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1)),
	max_length=256,
	repetition_penalty=3.0,
	)
	for tokens in out:
	print(dec_tokenizer.decode(tokens, skip_special_tokens=True))
	# После 4 декабря 2000 года
	# Не так давно, это многое читала!
	# Я не понимала того, что происходит сейчас тогда, дальше.
	# Британская столица Англии.
	```