|
--- |
|
language: |
|
- en |
|
- ru |
|
- zh |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- text2text-generation |
|
- t5 |
|
base_model: |
|
- utrobinmv/t5_translate_en_ru_zh_base_200 |
|
license: apache-2.0 |
|
pipeline_tag: sentence-similarity |
|
widget: |
|
- example_title: translate zh-ru |
|
text: > |
|
translate to ru: 开发的目的是为用户提供个人同步翻译。 |
|
- example_title: translate ru-en |
|
text: > |
|
translate to en: Цель разработки — предоставить пользователям личного синхронного переводчика. |
|
- example_title: translate en-ru |
|
text: > |
|
translate to ru: The purpose of the development is to provide users with a personal synchronized interpreter. |
|
- example_title: translate en-zh |
|
text: > |
|
translate to zh: The purpose of the development is to provide users with a personal synchronized interpreter. |
|
- example_title: translate zh-en |
|
text: > |
|
translate to en: 开发的目的是为用户提供个人同步解释器。 |
|
- example_title: translate ru-zh |
|
text: > |
|
translate to zh: Цель разработки — предоставить пользователям личного синхронного переводчика. |
|
--- |
|
|
|
# T5 English, Russian and Chinese sentence similarity model |
|
|
|
This is a [sentence-transformers](https://www.sbert.net/) model: It maps sentences & paragraphs to a 768 dimensional dense vector space. The model works well for sentence similarity tasks, but doesn't perform that well for semantic search tasks. |
|
|
|
The model can be used to search for parallel texts in Russian, English and Chinese. |
|
|
|
To determine the similarity of sentences in the model, only the encoder from the T5-based model is used. |
|
|
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
|
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
import torch.nn.functional as F |
|
|
|
model = SentenceTransformer('utrobinmv/t5_translate_en_ru_zh_base_200_sent') |
|
|
|
sentences_1 = ["The purpose of the development is to provide users with a personal simultaneous interpreter.", |
|
"Съешь ещё этих мягких французских булок.", |
|
"再吃这些法国的甜蜜的面包。"] |
|
|
|
sentences_2 = ["Цель разработки — предоставить пользователям личного синхронного переводчика.", |
|
"Have some more of these soft French rolls.", |
|
"开发的目的就是向用户提供个性化的同步翻译。"] |
|
|
|
embeddings = model.encode(sentences_1+sentences_2) |
|
embeddings_1 = embeddings[:len(sentences_1)] |
|
embeddings_2 = embeddings[len(sentences_1):] |
|
|
|
similarity = embeddings_1 @ embeddings_2.T |
|
print(similarity) |
|
#[[ 0.8956245 -0.0390042 0.8493222 ] |
|
# [ 0.00778637 0.85185283 -0.010229 ] |
|
# [ 0.01991986 0.72560245 0.02547248]] |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
Example translate Russian to Chinese |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
device = 'cuda' #or 'cpu' for translate on cpu |
|
|
|
model_name = 'utrobinmv/t5_translate_en_ru_zh_base_200_sent' |
|
model = T5ForConditionalGeneration.from_pretrained(model_name) |
|
model.to(device) |
|
tokenizer = T5Tokenizer.from_pretrained(model_name) |
|
|
|
prefix = 'translate to zh: ' |
|
src_text = prefix + "Съешь ещё этих мягких французских булок." |
|
|
|
# translate Russian to Chinese |
|
input_ids = tokenizer(src_text, return_tensors="pt") |
|
|
|
generated_tokens = model.generate(**input_ids.to(device)) |
|
|
|
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
print(result) |
|
# 再吃这些法国的甜蜜的面包。 |
|
``` |
|
|
|
|
|
|
|
and Example translate Chinese to Russian |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
device = 'cuda' #or 'cpu' for translate on cpu |
|
|
|
model_name = 'utrobinmv/t5_translate_en_ru_zh_base_200_sent' |
|
model = T5ForConditionalGeneration.from_pretrained(model_name) |
|
model.to(device) |
|
tokenizer = T5Tokenizer.from_pretrained(model_name) |
|
|
|
prefix = 'translate to ru: ' |
|
src_text = prefix + "再吃这些法国的甜蜜的面包。" |
|
|
|
# translate Russian to Chinese |
|
input_ids = tokenizer(src_text, return_tensors="pt") |
|
|
|
generated_tokens = model.generate(**input_ids.to(device)) |
|
|
|
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
print(result) |
|
# Съешьте этот сладкий хлеб из Франции. |
|
``` |
|
|
|
|
|
|
|
## |
|
|
|
|
|
|
|
## Languages covered |
|
|
|
Russian (ru_RU), Chinese (zh_CN), English (en_US) |
|
|