Edit model card

Icelandic Tokenizer README

Overview

This BPE (Byte Pair Encoding) tokenizer is designed for the Icelandic GPT model, available at Sigurdur/ice-gpt. Trained on the Icelandic Gigaword Corpus ({IGC}-2022) - annotated version, it excels in accurately segmenting Icelandic text into meaningful tokens.

Usage

Integrate this tokenizer into your NLP pipeline for preprocessing Icelandic text. The following example demonstrates basic usage:

from transformers import GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("Sigurdur/ice-tokenizer")
tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer("Halló heimur!")["input_ids"]

Citation

If you use this tokenizer in your work, please cite the original source of the training data:

@misc{20.500.12537/254,
  title = {Icelandic Gigaword Corpus ({IGC}-2022) - annotated version},
  author = {Barkarson, Starkaður and Steingrímsson, Steinþór and Andrésdóttir, Þórdís Dröfn and Hafsteinsdóttir, Hildur and Ingimundarson, Finnur Ágúst and Magnússon, Árni Davíð},
  url = {http://hdl.handle.net/20.500.12537/254},
  note = {{CLARIN}-{IS}},
  year = {2022}
}

Feedback

We welcome user feedback to enhance the tokenizer's functionality. Feel free to reach out with your insights and suggestions.

Happy tokenizing!

Sigurdur Haukur Birgisson

(readme created with chatgpt)

Downloads last month
0
Unable to determine this model’s pipeline type. Check the docs .

Collection including Sigurdur/ice-tokenizer