CmdCody commited on
Commit
4761e1d
1 Parent(s): 4a6f9f7

Update Readme for version 2.0 of the model

Browse files
Files changed (1) hide show
  1. README.md +17 -8
README.md CHANGED
@@ -13,15 +13,14 @@ This is an [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-60
13
  the Northern Frisian dialect Mooring following [this great blogpost](https://cointegrated.medium.com/a37fc706b865).
14
 
15
  ## Data
16
- The dataset for finetuning consisted of 5597 sentence pairs of the Ååstermooring dialect of North Frisian with German translation,
17
- with 500 random pairs being retained for validation.
18
- Most examples (roughly 4200) were taken directly from
19
  ["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf)
20
  published by the Nordfriisk Instituut. For sentence splitting the python
21
- [sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used, however the splitting wasn't perfect,
22
- especially in cases of direct speech, so that after manual re-alignment still many of these pairs consisted in fact of multiple sentences.
23
- A further roughly 1200 examples were taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988.
24
- Finally, a little over 100 very simple self-written examples were added.
25
 
26
  ## Usage
27
  How to use the model:
@@ -76,4 +75,14 @@ tokenizer = create_tokenizer_with_new_lang(path, 'frr_Latn')
76
  model = AutoModelForSeq2SeqLM.from_pretrained(path)
77
 
78
  translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
79
- ```
 
 
 
 
 
 
 
 
 
 
 
13
  the Northern Frisian dialect Mooring following [this great blogpost](https://cointegrated.medium.com/a37fc706b865).
14
 
15
  ## Data
16
+ The dataset for finetuning consisted of 7194 sentence pairs of the Ååstermooring dialect of North Frisian with German translation.
17
+ Most examples (roughly 5100) were taken directly from
 
18
  ["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf)
19
  published by the Nordfriisk Instituut. For sentence splitting the python
20
+ [sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used. The splitting wasn't perfect,
21
+ especially in cases of direct speech, so that manual re-alignment and further splitting was necessary.
22
+ A further roughly 2000 examples were taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988.
23
+ Finally, a little under 180 very simple self-written examples were used as evaluation data set.
24
 
25
  ## Usage
26
  How to use the model:
 
75
  model = AutoModelForSeq2SeqLM.from_pretrained(path)
76
 
77
  translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
78
+ ```
79
+
80
+ ## Training
81
+ The model was trained in a Google Colab notebook for 5000 steps and a batch size of 16 following the above mentioned blog post.
82
+
83
+ Metrics on the evaluation data set:
84
+
85
+ | | Bleu | ChrF++ |
86
+ |-----------|-------|--------|
87
+ | Frr -> De | 48.79 | 65.12 |
88
+ | De -> Frr | 47.56 | 65.03 |