metadata
license: cc-by-2.5
language:
- lt
- en
datasets:
- scoris/en-lt-merged-data
metrics:
- sacrebleu
Overview
This is an English-Lithuanian translation model based on Helsinki-NLP/opus-mt-tc-big-en-lt
For Lithuanian-English translation check another model scoris/opus-mt-tc-big-lt-en-scoris-finetuned
Fine-tuned on large merged data set: scoris/en-lt-merged-data (5.4 million sentence pairs)
Trained on 6 epochs.
Made by Scoris team
Evaluation:
Tested on scoris/en-lt-merged-data validation set. Metric: sacrebleu
model | testset | BLEU | Gen Len |
---|---|---|---|
scoris/opus-mt-tc-big-en-lt-scoris-finetuned | scoris/en-lt-merged-data (validation) | 41.8841 | 17.4785 |
Helsinki-NLP/opus-mt-tc-big-en-lt | scoris/en-lt-merged-data (validation) | 34.2768 | 17.6664 |
According to Google BLEU score interpretation is following:
BLEU Score | Interpretation |
---|---|
< 10 | Almost useless |
10 - 19 | Hard to get the gist |
20 - 29 | The gist is clear, but has significant grammatical errors |
30 - 40 | Understandable to good translations |
40 - 50 | High quality translations |
50 - 60 | Very high quality, adequate, and fluent translations |
> 60 | Quality often better than human |
Usage
You can use the model in the following way:
from transformers import MarianMTModel, MarianTokenizer
# Specify the model identifier on Hugging Face Model Hub
model_name = "scoris/opus-mt-tc-big-en-lt-scoris-finetuned"
# Load the model and tokenizer from Hugging Face
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
src_text = [
"Once upon a time there were three bears, who lived together in a house of their own in a wood.",
"One of them was a little, small wee bear; one was a middle-sized bear, and the other was a great, huge bear.",
"One day, after they had made porridge for their breakfast, they walked out into the wood while the porridge was cooling.",
"And while they were walking, a little girl came into the house. "
]
# Tokenize the text and generate translations
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
# Print out the translations
for t in translated:
print(tokenizer.decode(t, skip_special_tokens=True))
# Result:
# Kažkada buvo trys lokiai, kurie gyveno kartu savame name miške.
# Vienas iš jų buvo mažas, mažas lokys; vienas buvo vidutinio dydžio lokys, o kitas buvo didelis, didžiulis lokys.
# Vieną dieną, pagaminę košės pusryčiams, jie išėjo į mišką, kol košė vėso.
# Jiems einant, į namus atėjo maža mergaitė.