Icebreaker
Collection
3 items
•
Updated
This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models.
BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun.
It has a vocab size of 3200.
Use the code below to get started with the model.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker")
tokens = tokenizer("Halló heimur!")
Sigurdur Haukur Birgissson: haukurbirgisson5@gmail.com