ldilov/llama2-bg-tokenizer

Purpose and Design of the Tokenizer

The tokenizer has been crafted with a specific focus on complementing the capabilities of Llama2 and Llama2-based models. Here's a detailed overview of its design philosophy and linguistic proficiency:

Vocabulary Composition

Size: The tokenizer boasts a comprehensive vocabulary of 32,000 tokens, ensuring a wide coverage of linguistic elements.
Design for Llama2 Models: It is explicitly designed to integrate seamlessly with Llama2 and Llama2-based models, enhancing their performance by providing a rich and well-structured linguistic dataset.

Linguistic Proficiency

English Understanding: While the tokenizer is capable of understanding English, providing basic support for processing and tokenizing English texts.
Bulgarian Proficiency: Its proficiency, however, is significantly heightened for Bulgarian. It has been specifically designed and optimized for the Bulgarian language, ensuring superior performance in recognizing and tokenizing Bulgarian texts.

Example with Hugging Face's Tokenizers API

from transformers import LlamaTokenizerFast

# Initialize the tokenizer
tokenizer = LlamaTokenizerFast.from_pretrained("ldilov/llama2-bg-tokenizer")

# Example text in Bulgarian
bg_text = "Това е примерен текст на български език."

# Example text in English
en_text = "This is a sample text in English."

# Tokenize the Bulgarian text
tokens_bg = tokenizer.encode(bg_text, return_tensors="pt")
print("Tokenized Bulgarian Text:", tokens_bg)

# Tokenize the English text
tokens_en = tokenizer.encode(en_text, return_tensors="pt")
print("Tokenized English Text:", tokens_en)

Output

Running the above code will produce output similar to the following, which shows the tokenized representation of both Bulgarian and English texts:

Tokenized Bulgarian Text: {'input_ids': [...], 'attention_mask': [...]}
Tokenized English Text: {'input_ids': [...], 'attention_mask': [...]}

Training Approach Overview

The tokenizer training approach showcases a sophisticated and advanced methodology tailored for optimizing tokenizer performance specifically for Llama2 compatibility. This approach stands out due to its comprehensive customization capabilities, dynamic adjustments based on dataset analysis, and the integration of advanced tokenization techniques. Here are the key components and advantages:

Customization and Special Tokens

Dynamic Special Tokens: Incorporates a configurable set of special tokens (<unk>, <s>, </s>), enhancing the tokenizer's ability to handle unknown tokens and start/end of sequence markers effectively.

Advanced Configuration

Utilizes a detailed configuration to fine-tune tokenizer behavior, including dropout rates, minimum token frequency, maximum sequence lengths, and padding strategies, ensuring optimal tokenization for varied text inputs.

Decoders

Components:
- decoders.Replace: Replaces specified characters (e.g., the replacement character "▁") with another character (e.g., a space), aiding in the reconstruction of the original text from tokenized sequences.
- decoders.ByteFallback: Provides a fallback mechanism for handling bytes directly, useful for dealing with unknown or out-of-vocabulary tokens.
- decoders.Fuse(): Fuses consecutive tokens when possible to reduce tokenization granularity, potentially improving model performance by reducing sparsity.
- decoders.Strip: Removes leading or trailing characters (e.g., spaces), cleaning up the tokenized output for further processing.
Impact: Decoders play a crucial role in translating tokenized sequences back into human-readable text, ensuring the tokenizer's output remains faithful to the original input while accommodating the model's needs.

Normalizers

Components:
- Prepend("▁"): Adds a specific character (e.g., the replacement character "▁") to the beginning of the text, marking the start of processing.
- Replace(r" ", "▁"): Replaces spaces with a specified character, aiding in distinguishing between spaces as part of the text and as token separators.
- NFKC(): Applies Unicode normalization (NFKC), standardizing characters and reducing the complexity of text encoding.
Impact: Normalizers standardize and prepare the input text for tokenization, improving the model's robustness and consistency in handling diverse text inputs.

Pre-tokenizers

Components:
- pre_tokenizers.Sequence([Punctuation()]): Applies a sequence of pre-tokenizers, such as identifying and separating punctuation, which helps in parsing the text more accurately before the main tokenization step.
Impact: Pre-tokenizers refine the input text by identifying and isolating components like punctuation, which enhances the tokenizer's ability to accurately segment text into tokens.

Post-processing Template

Components:
- TemplateProcessing(single, pair, special_tokens): Defines templates for processing single inputs and pairs of inputs, incorporating special tokens at specified positions.
Impact: Post-processing templates dictate how tokenized sequences are structured, ensuring that special tokens are correctly placed. This is crucial for tasks that require understanding the relationship between sequences (e.g., question-answering), as it impacts how the model interprets sequence boundaries and relationships.

Dynamic Token Adjustments

Dynamic Token Selection: Employs statistical analysis to dynamically adjust the minimum frequency of tokens and identify rare but significant tokens (dynamic_tokens) for inclusion, improving model performance on specific domains or datasets.

Training and Evaluation Mechanism

Efficient Training: Leverages a custom training loop that merges datasets, applies dynamic token adjustments, and trains the tokenizer on merged datasets, prioritizing efficiency and effectiveness.
Evaluation: Includes a sophisticated evaluation mechanism to assess tokenizer performance using a holdout dataset, focusing on round-trip errors and tokenization loss, ensuring the tokenizer's reliability and accuracy.

Advanced Tokenization Techniques

Byte-Pair Encoding (BPE) with Custom Extensions: Enhances the standard BPE algorithm with byte fallback, dropout, and unknown token fusion, addressing common tokenization challenges and improving token representation.

Sophisticated Normalization and Pre-tokenization

Implements a sequence of normalization and pre-tokenization steps that prepare text data for tokenization, improving the model's ability to understand and process varied textual inputs.

Comprehensive Post-processing

Template Processing: Utilizes template processing for single and pair tokenization tasks, incorporating special tokens effectively and ensuring consistent tokenization patterns.

Advantages Over Regular Approaches

Dynamic Dropout: Tokenizer training process doesn't use predefined dropout but instead calculates on the fly specifically tailored value, based on the current training dataset. This ensures that tokenizer model can generalize better by putting more weight on context rather than specifics. This would be beneficial at later stage when finetuning LLM with this tokenizer.
Dynamic Adaptation: The ability to dynamically adjust tokenization parameters based (like min_frequency) on dataset analysis ensures that the tokenizer remains effective across different text domains.
Sophisticated Evaluation: The inclusion of a detailed evaluation mechanism enables continuous assessment and improvement of the tokenizer's performance, ensuring high accuracy and reliability.
Number Bucketing: Numbers in the text are categorized into predefined "buckets" based on their value. The bucketing process involves dividing the number space into several ranges (or buckets) and assigning each number to a specific bucket. Each bucket is represented by its own token that follows specific convention. Common years (e.g., 1900-2025) and ages (e.g., 1-100) are exceptions to this rule and they are represented they way they are written. This reduces sparsity and improves generalization without overfitting to specific values
URL Replacement: URLs in the text are identified using a regular expression for common URL patterns and replaced with a special token <url>. Replacing varied URLs with a single token prevents the model from overfitting to specific web addresses, which are usually not relevant to understanding the text's general context.URLs can introduce a vast number of unique tokens into the vocabulary. Replacing them with a single token significantly simplifies the model's vocabulary. By abstracting away the specifics of URLs, models can focus more on the actual textual content.

Tokenizer Evaluation Methodology

The evaluation of the tokenizer is crucial to ensure its effectiveness and accuracy. The approach used for evaluation relies on assessing the tokenizer's ability to accurately encode and decode textual data, aiming to measure how well the tokenizer can reproduce the original text after a round-trip of tokenization and detokenization. Here's a detailed explanation of how the loss function works and the significance of the evaluation scores:

Loss Function Breakdown

Tokenization: Each sentence in the dataset is encoded to token IDs using tokenizer.encode(example).ids. This step converts text into a sequence of tokens that the model can understand.
Detokenization: The token IDs are then decoded back into text using tokenizer.decode(tokenizer.encode(original).ids). This step attempts to reconstruct the original text from the token IDs.
Distance Calculation: For texts that do not match, the Levenshtein distance (a measure of the difference between two sequences) is calculated between the original and detokenized text, normalized by the original text length. This distance provides a quantitative measure of how much the texts differ.
Loss: The overall loss is computed as the average of these distances (distance / round_trip_errors), providing a single metric that reflects the tokenizer's accuracy in reproducing the original text.

Evaluation Results

The evaluation results after training and testing the tokenizer with 5,000 random sentences not included in the training corpus are summarized in the table below:

Version	Vocab Size	Loss	Training Time (seconds)
v1.1	32,000	0.00791045752872809	9188.8694

Interpreting the Evaluation Score

Vocab Size: Indicates the tokenizer's vocabulary size. A larger vocabulary can potentially capture more nuances in the text but might also increase the risk of overfitting or inefficiency. Current vocab size is compatible with existing Llama2 based models
Loss: The average normalized Levenshtein distance across all errors. Evaluating on sentences not included in the training corpus and achieving such a low loss value highlights the tokenizer's strong generalization capability.

Mathematical Significance of the Evaluation Score

From Levenshtein distance definition => On average, the necessary edits to recover the original text from the detokenized output account for 0.79% of the length of the original texts.

The loss value of 0.00791045752872809 suggests that the tokenizer performs well in maintaining the integrity of the text through the tokenization process and sustains a high level of fidelity. Mathematically, this low loss score signifies a high degree of similarity between the original and detokenized texts, demonstrating the tokenizer's effectiveness. The process of detokenization, converting tokenized representations back into their original text form, does not always guarantee a 1:1 exact match to the original text. While the goal of detokenization is to reconstruct the original text as closely as possible, minor differences can occur. These variances are generally acceptable and sometimes inevitable. Most NLP models and applications can tolerate some level of discrepancy between original and processed texts.

This approach represents a significant advancement over regular tokenization methods, offering a more adaptable, efficient, and accurate solution for preparing text data for machine learning models, especially those compatible with Llama2.

Credits and Dataset Acknowledgments

When utilizing datasets from the Hugging Face 🤗 Datasets library for training models, it's crucial to acknowledge the contributions of the authors and organizations that have made these resources available. Below is a formatted credits section recognizing the datasets used:

Dataset Acknowledgments

OSCAR Dataset:
- Source: OSCAR "unshuffled_deduplicated_bg"
- Description: A large-scale corpus obtained by language classification and filtering of the Common Crawl corpus.
- Authors: The OSCAR team from INRIA.
Bulgarian Poems Dataset:
- Source: Dilyana56/bulgarian_poems
- Description: A collection of Bulgarian poems.
- Authors: Dilyana Aleksandrova.
BG OPUS100 Processed Dataset:
- Source: anuragshas/bg_opus100_processed
- Description: Processed dataset for Bulgarian language based on OPUS100.
- Authors: Anurag Shandilya.
Reasoning BG Dataset:
- Source: reasoning_bg "philosophy-12th"
- Description: Dataset containing philosophical questions for reasoning tasks.
- Authors: Momchil Hardalov, Ivan Koychev, Preslav Nakov.
Clickbait News BG Dataset:
- Source: clickbait_news_bg
- Description: Dataset for detecting clickbait and fake news in Bulgarian.
- Authors: Bulgarian Association of PR Agencies.

Training Corpus

Records: Trained on over 700,000 bulgarian sentences
Total tokens: ~21,000,000 tokens

Training Code Repository

URL: https://github.com/ldilov/llama2-bg-tokenizer
Author: Lazar Dilov (me)

Evaluation Corpus

Records: Evaluated on over 5000 bulgarian sentences
Total tokens: ~15,000 tokens

Acknowledging the sources of datasets and estimating the volume of training data are crucial steps in ensuring transparency and reproducibility in machine learning projects. These acknowledgments not only give credit where it's due but also provide insights into the scale and nature of the data used for model training.

ldilov
/

llama2-bg-tokenizer