Fill-Mask
Transformers
PyTorch
Safetensors
5 languages
xlm-roberta
roberta
icelandic
norwegian
faroese
danish
swedish
masked-lm
Inference Endpoints
File size: 2,186 Bytes
01e9790
 
 
 
 
b475f92
01e9790
 
 
 
 
 
90f4917
01e9790
 
 
 
 
 
 
 
 
 
b475f92
 
 
 
 
 
01e9790
 
 
 
0f86e40
 
 
 
 
 
 
 
 
 
 
 
 
406d256
01e9790
406d256
9fccf23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b475f92
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
language:
- is
- da
- sv
- 'no'
- fo
widget:
- text: Fina lilla<mask>, jag vill inte bliva stur.
- text: Nu ved jeg, at du frygter<mask> og end ikke vil nægte mig din eneste søn..
- text: Það er vorhret á<mask>, napur vindur sem hvín.
- text: Ja, Gud signi<mask>, mítt land.
- text: Alle dyrene i<mask>  være venner.
tags:
- roberta
- icelandic
- norwegian
- faroese
- danish
- swedish
- masked-lm
- pytorch
license: agpl-3.0
datasets:
- vesteinn/FC3
- vesteinn/IC3
- mideind/icelandic-common-crawl-corpus-IC3
- NbAiLab/NCC
- DDSC/partial-danish-gigaword-no-twitter
---

# ScandiBERT

Note note: The model has been updated on 2022-09-27

The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.

| Language  | Data                                  | Size   |
|-----------|---------------------------------------|--------|
| Icelandic | See IceBERT paper                     | 16 GB  |
| Danish    | Danish Gigaword Corpus (incl Twitter) | 4,7 GB |
| Norwegian | NCC corpus                            | 42 GB  |
| Swedish   | Swedish Gigaword Corpus               | 3,4 GB |
| Faroese   | FC3 + Sosialurinn + Bible             | 69 MB  |


Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated.

This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/

If you find this model useful, please cite

```
@inproceedings{snaebjarnarson-etal-2023-transfer,
    title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
    author = "Snæbjarnarson, Vésteinn  and
      Simonsen, Annika  and
      Glavaš, Goran  and
      Vulić, Ivan",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = "may 22--24",
    year = "2023",
    address = "Tórshavn, Faroe Islands",
    publisher = {Link{\"o}ping University Electronic Press, Sweden},
}
```