Micka Tokenizer
The Micka Tokenizer is a subword tokenizer with a vocabulary size of 32,768 tokens, designed to support a wide array of natural language processing (NLP) tasks in both Slovenian and English.
It includes a series of special tokens tailored for handling various structured language tasks, from standard natural language processing needs to more specific chatbot and conversational models. Special unicode characters are used as delimiters (⸢
and ⸥
) at the start and end of each special token, reducing the likelihood that these token sequences will appear in regular text data.
- Start delimiter:
⸢
(Unicode:U+2E22
) - End delimiter:
⸥
(Unicode:U+2E25
)
Special Tokens
The tokenizer supports a range of special tokens that enable structured processing, padding, and segmentation in different tasks:
Padding, Masking, and Separation Tokens:
- Padding (
⸢PAD⸥
): Used to pad sequences to a uniform length. - Masking (
⸢MSK⸥
): Marks parts of a sequence that should be masked for certain tasks, such as masked language modeling. - Separation (
⸢|⸥
): A general-purpose separator token.
- Padding (
Sentence and Paragraph Structure Tokens:
- Start of Sentence (
⸢s⸥
) and End of Sentence (⸢/s⸥
): Define sentence boundaries for better structuring and parsing within long text sequences. - Start of Paragraph (
⸢p⸥
) and End of Paragraph (⸢/p⸥
): Define paragraph breaks, which can be helpful for document-based tasks or summarization tasks where paragraph structure is essential.
- Start of Sentence (
Out-of-Vocabulary / Unknown Token:
- Unknown Token (
⸢UNK⸥
): Represents out-of-vocabulary words or unknown tokens that the model encounters.
- Unknown Token (
Chatbot and Conversational Tokens:
- System Message Start (
⸢SYS⸥
): Marks the beginning of a system message, useful in multi-turn dialogue systems. - User Message Start (
⸢USR⸥
): Designates the start of a user’s input, enabling clear differentiation in conversation logs. - Agent Message Start (
⸢AGT⸥
): Indicates the start of a response from the chatbot or conversational agent.
- System Message Start (
Model Details
- Developed by: Marko Kokol
- Version: 1.0
- License: CC BY-SA 4.0
- Languages Supported: Slovenian (sl), English (en)
Training Data
The tokenizer was trained on the following datasets:
Gigafida Corpus: A comprehensive reference corpus of written standard Slovene, containing over 1.1 billion words from texts composed between 1990 and 2018. (cjvt.si)
Wikipedia Subset:
- English Wikipedia: A subset of 10,000 pages.
- Slovenian Wikipedia: All pages not identified as stubs.
Intended Use
The Micka Tokenizer is suitable for various NLP tasks, including but not limited to:
- Text classification
- Named entity recognition
- Machine translation
- Text generation
Limitations
- Language Coverage: While the tokenizer is trained on Slovenian and English texts, performance may degrade on texts from other languages.
- Domain Specificity: The tokenizer may not perform optimally on specialized domains not represented in the training data.
Funding
The development of the Micka Tokenizer was partially funded by the PoVeJMo project, which aims to develop large language models for the Slovenian language. The project PoVeJMo is cofinanced by:
License
This tokenizer is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). This license allows for sharing and adaptation, provided appropriate credit is given and any derivatives are distributed under the same license.
Citation
If you use the Micka Tokenizer in your work, please cite it as follows: Marko Kokol. (2024). Micka Tokenizer: A Subword Tokenizer for Slovenian and English. Version 1.0. Available at https://huggingface.co/klokedm/micka-32768.