Update Readme for version 2.0 of the model
Browse files
README.md
CHANGED
@@ -13,15 +13,14 @@ This is an [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-60
|
|
13 |
the Northern Frisian dialect Mooring following [this great blogpost](https://cointegrated.medium.com/a37fc706b865).
|
14 |
|
15 |
## Data
|
16 |
-
The dataset for finetuning consisted of
|
17 |
-
|
18 |
-
Most examples (roughly 4200) were taken directly from
|
19 |
["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf)
|
20 |
published by the Nordfriisk Instituut. For sentence splitting the python
|
21 |
-
[sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used
|
22 |
-
especially in cases of direct speech, so that
|
23 |
-
A further roughly
|
24 |
-
Finally, a little
|
25 |
|
26 |
## Usage
|
27 |
How to use the model:
|
@@ -76,4 +75,14 @@ tokenizer = create_tokenizer_with_new_lang(path, 'frr_Latn')
|
|
76 |
model = AutoModelForSeq2SeqLM.from_pretrained(path)
|
77 |
|
78 |
translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
|
79 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
the Northern Frisian dialect Mooring following [this great blogpost](https://cointegrated.medium.com/a37fc706b865).
|
14 |
|
15 |
## Data
|
16 |
+
The dataset for finetuning consisted of 7194 sentence pairs of the Ååstermooring dialect of North Frisian with German translation.
|
17 |
+
Most examples (roughly 5100) were taken directly from
|
|
|
18 |
["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf)
|
19 |
published by the Nordfriisk Instituut. For sentence splitting the python
|
20 |
+
[sentence-splitting library](https://pypi.org/project/sentence-splitter/) was used. The splitting wasn't perfect,
|
21 |
+
especially in cases of direct speech, so that manual re-alignment and further splitting was necessary.
|
22 |
+
A further roughly 2000 examples were taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988.
|
23 |
+
Finally, a little under 180 very simple self-written examples were used as evaluation data set.
|
24 |
|
25 |
## Usage
|
26 |
How to use the model:
|
|
|
75 |
model = AutoModelForSeq2SeqLM.from_pretrained(path)
|
76 |
|
77 |
translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
|
78 |
+
```
|
79 |
+
|
80 |
+
## Training
|
81 |
+
The model was trained in a Google Colab notebook for 5000 steps and a batch size of 16 following the above mentioned blog post.
|
82 |
+
|
83 |
+
Metrics on the evaluation data set:
|
84 |
+
|
85 |
+
| | Bleu | ChrF++ |
|
86 |
+
|-----------|-------|--------|
|
87 |
+
| Frr -> De | 48.79 | 65.12 |
|
88 |
+
| De -> Frr | 47.56 | 65.03 |
|