CmdCody
/

nllb-deu-moo

@@ -13,15 +13,14 @@ This is an [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-60
 the Northern Frisian dialect Mooring following [this great blogpost](https://cointegrated.medium.com/a37fc706b865).
 ## Data
-The dataset for finetuning consisted of 5597 sentence pairs of the Ååstermooring dialect of North Frisian with German translation,
-with 500 random pairs being retained for validation.
-Most examples (roughly 4200) were taken directly from
 ["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf)
 published by the Nordfriisk Instituut. For sentence splitting the python
-[sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used, however the splitting wasn't perfect,
-especially in cases of direct speech, so that after manual re-alignment still many of these pairs consisted in fact of multiple sentences.
-A further roughly 1200 examples were taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988.
-Finally, a little over 100 very simple self-written examples were added.
 ## Usage
 How to use the model:
@@ -76,4 +75,14 @@ tokenizer = create_tokenizer_with_new_lang(path, 'frr_Latn')
 model = AutoModelForSeq2SeqLM.from_pretrained(path)
 translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
-```

 the Northern Frisian dialect Mooring following [this great blogpost](https://cointegrated.medium.com/a37fc706b865).
 ## Data
+The dataset for finetuning consisted of 7194 sentence pairs of the Ååstermooring dialect of North Frisian with German translation.
+Most examples (roughly 5100) were taken directly from
 ["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf)
 published by the Nordfriisk Instituut. For sentence splitting the python
+[sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used. The splitting wasn't perfect,
+especially in cases of direct speech, so that manual re-alignment and further splitting was necessary.
+A further roughly 2000 examples were taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988.
+Finally, a little under 180 very simple self-written examples were used as evaluation data set.
 ## Usage
 How to use the model:
 model = AutoModelForSeq2SeqLM.from_pretrained(path)
 translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
+```
+## Training
+The model was trained in a Google Colab notebook for 5000 steps and a batch size of 16 following the above mentioned blog post.
+Metrics on the evaluation data set:
+|           | Bleu  | ChrF++ |
+|-----------|-------|--------|
+| Frr -> De | 48.79 | 65.12  |
+| De -> Frr | 47.56 | 65.03  |