Micka Tokenizer

The Micka Tokenizer is a subword tokenizer with a vocabulary size of 32,768 tokens, designed to support a wide array of natural language processing (NLP) tasks in both Slovenian and English.

It includes a series of special tokens tailored for handling various structured language tasks, from standard natural language processing needs to more specific chatbot and conversational models. Special unicode characters are used as delimiters (⸢ and ⸥) at the start and end of each special token, reducing the likelihood that these token sequences will appear in regular text data.

Start delimiter: ⸢ (Unicode: U+2E22)
End delimiter: ⸥ (Unicode: U+2E25)

Special Tokens

The tokenizer supports a range of special tokens that enable structured processing, padding, and segmentation in different tasks:

Padding, Masking, and Separation Tokens:
- Padding (⸢PAD⸥): Used to pad sequences to a uniform length.
- Masking (⸢MSK⸥): Marks parts of a sequence that should be masked for certain tasks, such as masked language modeling.
- Separation (⸢|⸥): A general-purpose separator token.
Sentence and Paragraph Structure Tokens:
- Start of Sentence (⸢s⸥) and End of Sentence (⸢/s⸥): Define sentence boundaries for better structuring and parsing within long text sequences.
- Start of Paragraph (⸢p⸥) and End of Paragraph (⸢/p⸥): Define paragraph breaks, which can be helpful for document-based tasks or summarization tasks where paragraph structure is essential.
Out-of-Vocabulary / Unknown Token:
- Unknown Token (⸢UNK⸥): Represents out-of-vocabulary words or unknown tokens that the model encounters.
Chatbot and Conversational Tokens:
- System Message Start (⸢SYS⸥): Marks the beginning of a system message, useful in multi-turn dialogue systems.
- User Message Start (⸢USR⸥): Designates the start of a user’s input, enabling clear differentiation in conversation logs.
- Agent Message Start (⸢AGT⸥): Indicates the start of a response from the chatbot or conversational agent.

Model Details

Developed by: Marko Kokol
Version: 1.0
License: CC BY-SA 4.0
Languages Supported: Slovenian (sl), English (en)

Training Data

The tokenizer was trained on the following datasets:

Gigafida Corpus: A comprehensive reference corpus of written standard Slovene, containing over 1.1 billion words from texts composed between 1990 and 2018. (cjvt.si)
Wikipedia Subset:
- English Wikipedia: A subset of 10,000 pages.
- Slovenian Wikipedia: All pages not identified as stubs.

Intended Use

The Micka Tokenizer is suitable for various NLP tasks, including but not limited to:

Text classification
Named entity recognition
Machine translation
Text generation

Limitations

Language Coverage: While the tokenizer is trained on Slovenian and English texts, performance may degrade on texts from other languages.
Domain Specificity: The tokenizer may not perform optimally on specialized domains not represented in the training data.

Funding

The development of the Micka Tokenizer was partially funded by the PoVeJMo project, which aims to develop large language models for the Slovenian language. The project PoVeJMo is cofinanced by:

License

This tokenizer is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). This license allows for sharing and adaptation, provided appropriate credit is given and any derivatives are distributed under the same license.

Citation

If you use the Micka Tokenizer in your work, please cite it as follows: Marko Kokol. (2024). Micka Tokenizer: A Subword Tokenizer for Slovenian and English. Version 1.0. Available at https://huggingface.co/klokedm/micka-32768.

klokedm
/

micka-32768