Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,67 @@
|
|
1 |
---
|
2 |
license: cc-by-2.5
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-2.5
|
3 |
+
language:
|
4 |
+
- lt
|
5 |
+
- en
|
6 |
+
datasets:
|
7 |
+
- scoris/en-lt-merged-data
|
8 |
---
|
9 |
+
# Overview
|
10 |
+
![Scoris logo](https://scoris.lt/logo_smaller.png)
|
11 |
+
This is an English-Lithuanian translation model based on [Helsinki-NLP/opus-mt-tc-big-en-lt](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-lt)
|
12 |
+
|
13 |
+
|
14 |
+
Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs)
|
15 |
+
|
16 |
+
Trained on 3 epochs.
|
17 |
+
|
18 |
+
Made by [Scoris](https://scoris.lt) team
|
19 |
+
|
20 |
+
# Evaluation:
|
21 |
+
Tested on scoris/en-lt-merged-data validation set. Metric: sacrebleu
|
22 |
+
|
23 |
+
| model | testset | BLEU | Gen Len |
|
24 |
+
|----------|---------|-------|-------|
|
25 |
+
| scoris/opus-mt-tc-big-lt-en-scoris-finetuned | scoris/en-lt-merged-data (validation) | TBD | TBD
|
26 |
+
| Helsinki-NLP/opus-mt-tc-big-lt-en | scoris/en-lt-merged-data (validation) | TBD | TBD
|
27 |
+
|
28 |
+
According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following:
|
29 |
+
|
30 |
+
| BLEU Score | Interpretation
|
31 |
+
|----------|---------|
|
32 |
+
| < 10 | Almost useless
|
33 |
+
| 10 - 19 | Hard to get the gist
|
34 |
+
| 20 - 29 | The gist is clear, but has significant grammatical errors
|
35 |
+
| 30 - 40 | Understandable to good translations
|
36 |
+
| **40 - 50** | **High quality translations**
|
37 |
+
| 50 - 60 | Very high quality, adequate, and fluent translations
|
38 |
+
| > 60 | Quality often better than human
|
39 |
+
|
40 |
+
# Usage
|
41 |
+
You can use the model in the following way:
|
42 |
+
```python
|
43 |
+
from transformers import MarianMTModel, MarianTokenizer
|
44 |
+
|
45 |
+
# Specify the model identifier on Hugging Face Model Hub
|
46 |
+
model_name = "scoris/opus-mt-tc-big-en-lt-scoris-finetuned"
|
47 |
+
|
48 |
+
# Load the model and tokenizer from Hugging Face
|
49 |
+
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
50 |
+
model = MarianMTModel.from_pretrained(model_name)
|
51 |
+
|
52 |
+
src_text = [
|
53 |
+
"Once upon a time there was a dear little girl who was loved by everyone who looked at her, but most of all by her grandmother, and there was nothing that she would not have given to the child.",
|
54 |
+
"Once she gave her a little cap of red velvet, which suited her so well that she would never wear anything else; so she was always called 'Little Red- Cap.'",
|
55 |
+
"One day her mother said to her: ‘Come, Little Red-Cap, here is a piece of cake and a bottle of wine; take them to your grandmother, she is ill and weak, and they will do her good.",
|
56 |
+
"Set out before it gets hot, and when you are going, walk nicely and quietly and do not run off the path, or you may fall and break the bottle, and then your grandmother will get nothing."
|
57 |
+
]
|
58 |
+
|
59 |
+
# Tokenize the text and generate translations
|
60 |
+
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
|
61 |
+
|
62 |
+
# Print out the translations
|
63 |
+
for t in translated:
|
64 |
+
print(tokenizer.decode(t, skip_special_tokens=True))
|
65 |
+
|
66 |
+
# TBD
|
67 |
+
```
|