# ScandiBERT

Note note: The model has been updated on 2022-09-27

The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.

Language Data Size
Icelandic See IceBERT paper 16 GB
Danish Danish Gigaword Corpus (incl Twitter) 4,7 GB
Norwegian NCC corpus 42 GB
Swedish Swedish Gigaword Corpus 3,4 GB
Faroese FC3 + Sosialurinn + Bible 69 MB

Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated.

This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/

If you find this model useful, please cite

<mask>