--- language: - is license: cc-by-4.0 datasets: - igc --- # Icelandic GPT-2 model This Icelandic GPT-2 language model was pretrained on the [Icelandic Gigaword Corpus](http://igc.arnastofnun.is/) (IGC, 2020 version), which contains approximately 1.532 million running words. The model was trained for 20 epochs on a TPU v3-8, with a total training time of 3 days and 21 hours. The hyperparameters used for training can be found in the [JAX/Flax documentation](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#train-model-1) for the Transformers library. The model uses a byte-level BPE tokenizer with a vocabulary size of 51,000. **Note**: This model was pretrained on a tokenized and sentence-segmentized version of the IGC, which is reflected by the generated text. A new version of this model, trained on a pre-tokenized version of IGC (2022 version), is forthcoming. # Acknowledgments This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).