pere commited on
Commit
bf8422d
1 Parent(s): 4d22a43

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -14,12 +14,14 @@ widget:
14
  Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish.
15
  The two written languages are considered equal and citizens have a right to receive public service information in their primary and prefered language. Even though this right has been around for a long time only between 5-10% of Norwegian texts are written in Nynorsk. Nynorsk is therefore a low-resource language within a low-resource language.
16
 
17
- For translating between the two languages, there are not any working off-the-shelf machine learning-based translation models.
 
18
  | | |
19
  |---|---|
20
  | Widget | Try the widget in the top right corner |
21
- | Huggingface Spaces | Go to [mt5](https://huggingface.co/google/mt5-base) |
22
  | | |
 
23
  ## Pretraining a T5-base
24
  There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
25
 
 
14
  Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish.
15
  The two written languages are considered equal and citizens have a right to receive public service information in their primary and prefered language. Even though this right has been around for a long time only between 5-10% of Norwegian texts are written in Nynorsk. Nynorsk is therefore a low-resource language within a low-resource language.
16
 
17
+ Apart from some word-list based engines, there are not any working off-the-shelf machine learning-based translation models. Translation between Bokmål and Nynorsk is not available in Google Translate.
18
+
19
  | | |
20
  |---|---|
21
  | Widget | Try the widget in the top right corner |
22
+ | Huggingface Spaces | Go to [mt5](https://huggingface.co/google/mt5-base) |
23
  | | |
24
+
25
  ## Pretraining a T5-base
26
  There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
27