gpt2-igc-is / README.md
Jón Daðason
Adding model
3162693
|
raw
history blame
1.01 kB
---
language:
- is
license: cc-by-4.0
datasets:
- igc
---
# Icelandic GPT-2 model
This Icelandic GPT-2 language model was pretrained on the [Icelandic Gigaword Corpus](http://igc.arnastofnun.is/) (IGC, 2020 version), which contains approximately 1.532 million running words. The model was trained for 20 epochs on a TPU v3-8, with a total training time of 3 days and 21 hours. The hyperparameters used for training can be found in the [JAX/Flax documentation](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#train-model-1) for the Transformers library. The model uses a byte-level BPE tokenizer with a vocabulary size of 51,000.
**Note**: This model was pretrained on a tokenized and sentence-segmentized version of the IGC, which is reflected by the generated text. A new version of this model, trained on a pre-tokenized version of IGC (2022 version), is forthcoming.
# Acknowledgments
This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).