--- language: - is - da - sv - 'no' - fo widget: - text: Fina lilla, jag vill inte bliva stur. - text: Nu ved jeg, at du frygter og end ikke vil nægte mig din eneste søn.. - text: Það er vorhret á, napur vindur sem hvín. - text: Ja, Gud signi, mítt land. - text: Alle dyrene i må være venner. tags: - roberta - icelandic - norwegian - faroese - danish - swedish - masked-lm - pytorch license: agpl-3.0 datasets: - vesteinn/FC3 - vesteinn/IC3 - mideind/icelandic-common-crawl-corpus-IC3 - NbAiLab/NCC - DDSC/partial-danish-gigaword-no-twitter --- # ScandiBERT Note note: The model has been updated on 2022-09-27 The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks. | Language | Data | Size | |-----------|---------------------------------------|--------| | Icelandic | See IceBERT paper | 16 GB | | Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB | | Norwegian | NCC corpus | 42 GB | | Swedish | Swedish Gigaword Corpus | 3,4 GB | | Faroese | FC3 + Sosialurinn + Bible | 69 MB | Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated. This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/ If you find this model useful, please cite ``` @inproceedings{snaebjarnarson-etal-2023-transfer, title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese", author = "Snæbjarnarson, Vésteinn and Simonsen, Annika and Glavaš, Goran and Vulić, Ivan", booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = "may 22--24", year = "2023", address = "Tórshavn, Faroe Islands", publisher = {Link{\"o}ping University Electronic Press, Sweden}, } ```