scoris commited on
Commit
96cac29
1 Parent(s): 3be77ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md CHANGED
@@ -1,3 +1,67 @@
1
  ---
2
  license: cc-by-2.5
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-2.5
3
+ language:
4
+ - lt
5
+ - en
6
+ datasets:
7
+ - scoris/en-lt-merged-data
8
  ---
9
+ # Overview
10
+ ![Scoris logo](https://scoris.lt/logo_smaller.png)
11
+ This is an English-Lithuanian translation model based on [Helsinki-NLP/opus-mt-tc-big-en-lt](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-lt)
12
+
13
+
14
+ Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs)
15
+
16
+ Trained on 3 epochs.
17
+
18
+ Made by [Scoris](https://scoris.lt) team
19
+
20
+ # Evaluation:
21
+ Tested on scoris/en-lt-merged-data validation set. Metric: sacrebleu
22
+
23
+ | model | testset | BLEU | Gen Len |
24
+ |----------|---------|-------|-------|
25
+ | scoris/opus-mt-tc-big-lt-en-scoris-finetuned | scoris/en-lt-merged-data (validation) | TBD | TBD
26
+ | Helsinki-NLP/opus-mt-tc-big-lt-en | scoris/en-lt-merged-data (validation) | TBD | TBD
27
+
28
+ According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following:
29
+
30
+ | BLEU Score | Interpretation
31
+ |----------|---------|
32
+ | < 10 | Almost useless
33
+ | 10 - 19 | Hard to get the gist
34
+ | 20 - 29 | The gist is clear, but has significant grammatical errors
35
+ | 30 - 40 | Understandable to good translations
36
+ | **40 - 50** | **High quality translations**
37
+ | 50 - 60 | Very high quality, adequate, and fluent translations
38
+ | > 60 | Quality often better than human
39
+
40
+ # Usage
41
+ You can use the model in the following way:
42
+ ```python
43
+ from transformers import MarianMTModel, MarianTokenizer
44
+
45
+ # Specify the model identifier on Hugging Face Model Hub
46
+ model_name = "scoris/opus-mt-tc-big-en-lt-scoris-finetuned"
47
+
48
+ # Load the model and tokenizer from Hugging Face
49
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
50
+ model = MarianMTModel.from_pretrained(model_name)
51
+
52
+ src_text = [
53
+ "Once upon a time there was a dear little girl who was loved by everyone who looked at her, but most of all by her grandmother, and there was nothing that she would not have given to the child.",
54
+ "Once she gave her a little cap of red velvet, which suited her so well that she would never wear anything else; so she was always called 'Little Red- Cap.'",
55
+ "One day her mother said to her: ‘Come, Little Red-Cap, here is a piece of cake and a bottle of wine; take them to your grandmother, she is ill and weak, and they will do her good.",
56
+ "Set out before it gets hot, and when you are going, walk nicely and quietly and do not run off the path, or you may fall and break the bottle, and then your grandmother will get nothing."
57
+ ]
58
+
59
+ # Tokenize the text and generate translations
60
+ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
61
+
62
+ # Print out the translations
63
+ for t in translated:
64
+ print(tokenizer.decode(t, skip_special_tokens=True))
65
+
66
+ # TBD
67
+ ```