--- language: - is library_name: transformers --- # Icebreaker tokenizer This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models. ## Model Details BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun. ### Model Description It has a vocab size of 3200. - **Developed by:** Sigurdur Haukur Birgisson - **Model type:** GPT2Tokenizer - **Language(s) (NLP):** Icelandic ### Model Sources - **Repository:** https://github.com/sigurdurhaukur/tokenicer ## How to Get Started with the Model Use the code below to get started with the model. ```py from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker") tokens = tokenizer("Halló heimur!") ``` ## Model Card Contact Sigurdur Haukur Birgissson: haukurbirgisson5@gmail.com