Edit model card

Icebreaker tokenizer

This is a BPE tokenizer trained on the Iceladic Gigaword Corpus, News 1. The tokenizer can be used for training Icelandic language models.

Model Details

BPE tokenizer, trained on the first 242553 files in the News 1 IGC 2022, unnanotated dataset by Arnastofnun.

Model Description

It has a vocab size of 3200.

  • Developed by: Sigurdur Haukur Birgisson
  • Model type: GPT2Tokenizer
  • Language(s) (NLP): Icelandic

Model Sources

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Sigurdur/icebreaker")
tokens = tokenizer("Halló heimur!")

Model Card Contact

Sigurdur Haukur Birgissson: haukurbirgisson5@gmail.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Collection including Sigurdur/icebreaker-tokenicer