IceBERT-mC4-is / README.md
vesteinn's picture
Update README.md
90bd4d6 verified
metadata
language: is
widget:
  - text:  bjóða þér <mask> í kvöld?
  - text: Forseti <mask> er ágæt.
  - text: Súpan var <mask> á bragðið.
tags:
  - roberta
  - icelandic
  - masked-lm
  - pytorch
license: cc-by-4.0

We do not recommend the use of this model besides for comparison with the other IceBERT models

IceBERT-mC4-is

This model was trained with fairseq using the RoBERTa-base architecture. It is one of many models we have trained for Icelandic, see the paper referenced below for further details. It was trained on the Icelandic part of the mC4 dataset.

Citation

The model is described in this paper https://arxiv.org/abs/2201.05601. Please cite the paper if you make use of the model.

@article{DBLP:journals/corr/abs-2201-05601,
  author    = {V{\'{e}}steinn Sn{\ae}bjarnarson and
               Haukur Barri S{\'{\i}}monarson and
               P{\'{e}}tur Orri Ragnarsson and
               Svanhv{\'{\i}}t Lilja Ing{\'{o}}lfsd{\'{o}}ttir and
               Haukur P{\'{a}}ll J{\'{o}}nsson and
               Vilhj{\'{a}}lmur {\TH}orsteinsson and
               Hafsteinn Einarsson},
  title     = {A Warm Start and a Clean Crawled Corpus - {A} Recipe for Good Language
               Models},
  journal   = {CoRR},
  volume    = {abs/2201.05601},
  year      = {2022},
  url       = {https://arxiv.org/abs/2201.05601},
  eprinttype = {arXiv},
  eprint    = {2201.05601},
  timestamp = {Thu, 20 Jan 2022 14:21:35 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2201-05601.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}