|
--- |
|
language: |
|
- is |
|
license: cc-by-4.0 |
|
datasets: |
|
- igc |
|
--- |
|
|
|
# Icelandic GPT-2 model |
|
This Icelandic GPT-2 language model was pretrained on the [Icelandic Gigaword Corpus](http://igc.arnastofnun.is/) (IGC, 2020 version), which contains approximately 1.532 million running words. The model was trained for 20 epochs on a TPU v3-8, with a total training time of 3 days and 21 hours. The hyperparameters used for training can be found in the [JAX/Flax documentation](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#train-model-1) for the Transformers library. The model uses a byte-level BPE tokenizer with a vocabulary size of 51,000. |
|
|
|
**Note**: This model was pretrained on a tokenized and sentence-segmentized version of the IGC, which is reflected by the generated text. A new version of this model, trained on a pre-tokenized version of IGC (2022 version), is forthcoming. |
|
|
|
# Acknowledgments |
|
This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC). |
|
|