utrobinmv's picture
update readme
23d46d0
metadata
language:
  - en
  - ru
  - zh
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - text2text-generation
  - t5
base_model:
  - utrobinmv/t5_translate_en_ru_zh_base_200
license: apache-2.0
pipeline_tag: sentence-similarity
widget:
  - example_title: translate zh-ru
    text: |
      translate to ru: 开发的目的是为用户提供个人同步翻译。
  - example_title: translate ru-en
    text: >
      translate to en: Цель разработки — предоставить пользователям личного
      синхронного переводчика.
  - example_title: translate en-ru
    text: >
      translate to ru: The purpose of the development is to provide users with a
      personal synchronized interpreter.
  - example_title: translate en-zh
    text: >
      translate to zh: The purpose of the development is to provide users with a
      personal synchronized interpreter.
  - example_title: translate zh-en
    text: |
      translate to en: 开发的目的是为用户提供个人同步解释器。
  - example_title: translate ru-zh
    text: >
      translate to zh: Цель разработки — предоставить пользователям личного
      синхронного переводчика.

T5 English, Russian and Chinese sentence similarity model

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space. The model works well for sentence similarity tasks, but doesn't perform that well for semantic search tasks.

The model can be used to search for parallel texts in Russian, English and Chinese.

To determine the similarity of sentences in the model, only the encoder from the T5-based model is used.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

model = SentenceTransformer('utrobinmv/t5_translate_en_ru_zh_base_200_sent')

sentences_1 = ["The purpose of the development is to provide users with a personal simultaneous interpreter.",
            "Съешь ещё этих мягких французских булок.",
            "再吃这些法国的甜蜜的面包。"]

sentences_2 = ["Цель разработки — предоставить пользователям личного синхронного переводчика.",
            "Have some more of these soft French rolls.",
            "开发的目的就是向用户提供个性化的同步翻译。"]

embeddings = model.encode(sentences_1+sentences_2)
embeddings_1 = embeddings[:len(sentences_1)]
embeddings_2 = embeddings[len(sentences_1):]

similarity = embeddings_1 @ embeddings_2.T
print(similarity)
#[[ 0.8956245  -0.0390042   0.8493222 ]
# [ 0.00778637  0.85185283 -0.010229  ]
# [ 0.01991986  0.72560245  0.02547248]]

Example translate Russian to Chinese

from transformers import T5ForConditionalGeneration, T5Tokenizer

device = 'cuda' #or 'cpu' for translate on cpu

model_name = 'utrobinmv/t5_translate_en_ru_zh_base_200_sent'
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(model_name)

prefix = 'translate to zh: '
src_text = prefix + "Съешь ещё этих мягких французских булок."

# translate Russian to Chinese
input_ids = tokenizer(src_text, return_tensors="pt")

generated_tokens = model.generate(**input_ids.to(device))

result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)
# 再吃这些法国的甜蜜的面包。

and Example translate Chinese to Russian

from transformers import T5ForConditionalGeneration, T5Tokenizer

device = 'cuda' #or 'cpu' for translate on cpu

model_name = 'utrobinmv/t5_translate_en_ru_zh_base_200_sent'
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(model_name)

prefix = 'translate to ru: '
src_text = prefix + "再吃这些法国的甜蜜的面包。"

# translate Russian to Chinese
input_ids = tokenizer(src_text, return_tensors="pt")

generated_tokens = model.generate(**input_ids.to(device))

result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)
# Съешьте этот сладкий хлеб из Франции.

Languages covered

Russian (ru_RU), Chinese (zh_CN), English (en_US)