--- license: cc-by-2.5 language: - lt - en datasets: - scoris/en-lt-merged-data metrics: - sacrebleu --- # Overview ![Scoris logo](https://scoris.lt/logo_smaller.png) This is an English-Lithuanian translation model (Seq2Seq). For Lithuanian-English translation check another model [scoris-mt-lt-en](https://huggingface.co/scoris/scoris-mt-lt-en) Original model: [Helsinki-NLP/opus-mt-tc-big-en-lt](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-lt) Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs) Trained on 6 epochs. Made by [Scoris](https://scoris.lt) team # Evaluation: | EN-LT | BLEU | |-----------------------------------|------| | scoris/scoris-mt-en-lt | 41.9 | | Helsinki-NLP/opus-mt-tc-big-en-lt | 34.3 | | Google Translate | 30.8 | | Deepl | 32.3 | _Evaluated on scoris/en-lt-merged-data validation set. Google and Deepl evaluated using a random sample of 1000 sentence pairs._ **According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following:** | BLEU Score | Interpretation |----------|---------| | < 10 | Almost useless | 10 - 19 | Hard to get the gist | 20 - 29 | The gist is clear, but has significant grammatical errors | 30 - 40 | Understandable to good translations | **40 - 50** | **High quality translations** | 50 - 60 | Very high quality, adequate, and fluent translations | > 60 | Quality often better than human # Usage You can use the model in the following way: ```python from transformers import MarianMTModel, MarianTokenizer # Specify the model identifier on Hugging Face Model Hub model_name = "scoris/scoris-mt-en-lt # Load the model and tokenizer from Hugging Face tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) src_text = [ "Once upon a time there were three bears, who lived together in a house of their own in a wood.", "One of them was a little, small wee bear; one was a middle-sized bear, and the other was a great, huge bear.", "One day, after they had made porridge for their breakfast, they walked out into the wood while the porridge was cooling.", "And while they were walking, a little girl came into the house. " ] # Tokenize the text and generate translations translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) # Print out the translations for t in translated: print(tokenizer.decode(t, skip_special_tokens=True)) # Result: # Kažkada buvo trys lokiai, kurie gyveno kartu savame name miške. # Vienas iš jų buvo mažas, mažas lokys; vienas buvo vidutinio dydžio lokys, o kitas buvo didelis, didžiulis lokys. # Vieną dieną, pagaminę košės pusryčiams, jie išėjo į mišką, kol košė vėso. # Jiems einant, į namus atėjo maža mergaitė. ```