--- language: is widget: - text: Má bjóða þér í kvöld? - text: Forseti er ágæt. - text: Súpan var á bragðið. tags: - roberta - icelandic - masked-lm - pytorch license: agpl-3.0 datasets: - mideind/icelandic-common-crawl-corpus-IC3 --- # IceBERT IceBERT was trained with fairseq using the RoBERTa-base architecture. The training data used is shown in the table below. | Dataset | Size | Tokens | |------------------------------------------------------|---------|--------| | Icelandic Gigaword Corpus v20.05 (IGC) | 8.2 GB | 1,388M | | Icelandic Common Crawl Corpus (IC3) | 4.9 GB | 824M | | Greynir News articles | 456 MB | 76M | | Icelandic Sagas | 9 MB | 1.7M | | Open Icelandic e-books (Rafbókavefurinn) | 14 MB | 2.6M | | Data from the medical library of Landspitali | 33 MB | 5.2M | | Student theses from Icelandic universities (Skemman) | 2.2 GB | 367M | | Total | 15.8 GB | 2,664M | If you find this model useful, please cite ``` @inproceedings{snaebjarnarson-etal-2022-warm, title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models", author = "Sn{\ae}bjarnarson, V{\'e}steinn and S{\'\i}monarson, Haukur Barri and Ragnarsson, P{\'e}tur Orri and Ing{\'o}lfsd{\'o}ttir, Svanhv{\'\i}t Lilja and J{\'o}nsson, Haukur and Thorsteinsson, Vilhjalmur and Einarsson, Hafsteinn", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.464", pages = "4356--4366", } ```