language: 'no'
license: CC-BY 4.0
tags:
- translation
datasets:
- oscar
widget:
- text: Skriv inn en tekst som du ønsker å oversette til en annen målform.
BLEU-SCORE 88.16 !!!
🇳🇴 Bokmål ⇔ Nynorsk 🇳🇴
Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish. The two written languages are considered equal and citizens have a right to receive public service information in their primary and prefered language. Even though this right has been around for a long time only between 5-10% of Norwegian texts are written in Nynorsk. Nynorsk is therefore a low-resource language within a low-resource language.
For translating between the two languages, there are not any working off-the-shelf machine learning-based translation models.
Widget | Try the widget in the top right corner |
Huggingface Spaces | Go to mt5 |
Google Docs Add-on (waiting approval) | Watch Gif-demo |
Pretraining a T5-base
There is an mt5 that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the Norwegian Colossal Corpus at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described here and the total size is 19GB.
Finetuning
Training for [30] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a BLEU score of [87.94] at training and a test score of [88.16] after training. Considering the similarity of the two languages a high score is expected, however a score above 60 is usually taken as a high score.
# Set up the pipeline
from transformers import pipeline
translator = pipeline("translation", model='pere/nb-nn-translation')
# Do the translation
text = "Hun vil ikke gi bort sine personlige data."
print(translator(text, max_length=255))