File size: 2,982 Bytes

3be77ed
 
96cac29
 
 
 
 
9998d92
 
3be77ed
96cac29
 
700015f
21e34bb
30b7e3d
96cac29
 
 
20e1eeb
30b7e3d
 
9998d92
96cac29
 
 
 
09cd528
 
 
 
 
 
be0fcad
 
96cac29
6f1b161
96cac29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc1b798
96cac29
 
 
 
 
 
21e34bb
 
 
 
96cac29
 
 
 
 
 
 
 
 
21e34bb
 
 
 
 
96cac29

---
license: cc-by-2.5
language:
- lt
- en
datasets:
- scoris/en-lt-merged-data
metrics:
- sacrebleu
---
# Overview
![Scoris logo](https://scoris.lt/logo_smaller.png)
This is an English-Lithuanian translation model (Seq2Seq). For Lithuanian-English translation check another model [scoris-mt-lt-en](https://huggingface.co/scoris/scoris-mt-lt-en)

Original model: [Helsinki-NLP/opus-mt-tc-big-en-lt](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-lt)

Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs)




Trained on 6 epochs. 

Made by [Scoris](https://scoris.lt) team

# Evaluation:
| EN-LT                             | BLEU |
|-----------------------------------|------|
| scoris/scoris-mt-en-lt            | 41.9 |
| Helsinki-NLP/opus-mt-tc-big-en-lt | 34.3 |
| Google Translate                  | 30.8 |
| Deepl                             | 32.3 |

_Evaluated on scoris/en-lt-merged-data validation set. Google and Deepl evaluated using a random sample of 1000 sentence pairs._

**According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following:**
| BLEU Score | Interpretation
|----------|---------|
| < 10 | Almost useless
| 10 - 19 | Hard to get the gist
| 20 - 29 | The gist is clear, but has significant grammatical errors
| 30 - 40 | Understandable to good translations
| **40 - 50** | **High quality translations**
| 50 - 60 | Very high quality, adequate, and fluent translations
| > 60 | Quality often better than human

# Usage
You can use the model in the following way:
```python
from transformers import MarianMTModel, MarianTokenizer

# Specify the model identifier on Hugging Face Model Hub
model_name = "scoris/scoris-mt-en-lt

# Load the model and tokenizer from Hugging Face
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

src_text = [
    "Once upon a time there were three bears, who lived together in a house of their own in a wood.",
    "One of them was a little, small wee bear; one was a middle-sized bear, and the other was a great, huge bear.",
    "One day, after they had made porridge for their breakfast, they walked out into the wood while the porridge was cooling.",
    "And while they were walking, a little girl came into the house. "
]

# Tokenize the text and generate translations
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

# Print out the translations
for t in translated:
    print(tokenizer.decode(t, skip_special_tokens=True))

# Result:
# Kažkada buvo trys lokiai, kurie gyveno kartu savame name miške.
# Vienas iš jų buvo mažas, mažas lokys; vienas buvo vidutinio dydžio lokys, o kitas buvo didelis, didžiulis lokys.
# Vieną dieną, pagaminę košės pusryčiams, jie išėjo į mišką, kol košė vėso.
# Jiems einant, į namus atėjo maža mergaitė.
```