Tokenizer v1.1

Browse files

Files changed (3) hide show

README.md +3 -4
tokenizer.json +0 -0
tokenizer_config.json +47 -8

README.md CHANGED Viewed

@@ -127,7 +127,6 @@ The tokenizer training approach showcases a sophisticated and advanced methodolo
 - **Dynamic Dropout**: Tokenizer training process doesn't use predefined `dropout` but instead calculates on the fly specifically tailored value, based on the current training dataset. This ensures that tokenizer model can generalize better by putting more weight on context rather than specifics. This would be beneficial at later stage when finetuning LLM with this tokenizer.
 - **Dynamic Adaptation**: The ability to dynamically adjust tokenization parameters based (like `min_frequency`) on dataset analysis ensures that the tokenizer remains effective across different text domains.
-- **Dynamic Tokens**: The dataset is divided into chunks, each chunk of the dataset is analyzed to count the occurrences of each token. This is done across all chunks in parallel, and the results are aggregated. A threshold (e.g., `0.0005`) is applied to identify tokens that constitute a small fraction of the total words in the dataset. Tokens below this threshold are considered rare or dynamic. From these dynamic tokens, the top `k` tokens with the highest counts (but still under the threshold) are selected. We add them manually to tokenizer's vocabulary so that the tokenizer can focus its attention on the most relevant rare tokens. Dynamic tokens often include terminology, names, or concepts specific to a dataset's domain. Their inclusion in the tokenizer's vocabulary allows the LLM to capture and understand these unique elements more effectively, leading to improved performance in tasks requiring deep domain knowledge or contextual nuance.
 - **Sophisticated Evaluation**: The inclusion of a detailed evaluation mechanism enables continuous assessment and improvement of the tokenizer's performance, ensuring high accuracy and reliability.
 - **Number Bucketing**:  Numbers in the text are categorized into predefined "buckets" based on their value. The bucketing process involves dividing the number space into several ranges (or buckets) and assigning each number to a specific bucket. Each bucket is represented by its own token that follows specific convention. Common years (e.g., 1900-2025) and ages (e.g., 1-100) are exceptions to this rule and they are represented they way they are written. This reduces sparsity and improves generalization without overfitting to specific values
 - **URL Replacement**: URLs in the text are identified using a regular expression for common URL patterns and replaced with a special token `<url>`.  Replacing varied URLs with a single token prevents the model from overfitting to specific web addresses, which are usually not relevant to understanding the text's general context.URLs can introduce a vast number of unique tokens into the vocabulary. Replacing them with a single token significantly simplifies the model's vocabulary. By abstracting away the specifics of URLs, models can focus more on the actual textual content.
@@ -152,7 +151,7 @@ The evaluation results after training and testing the tokenizer with 5,000 rando
 | Version | Vocab Size | Loss            | Training Time (seconds) |
 |---------|------------| ----------------|-------------------------|
-| v1      | 32,000     | 0.08300010458445718 | 9188.8694               |
 ### **Interpreting the Evaluation Score**
@@ -162,9 +161,9 @@ The evaluation results after training and testing the tokenizer with 5,000 rando
 ### **Mathematical Significance of the Evaluation Score**
-From `Levenshtein distance` definition  =>  **On average, the necessary edits to recover the original text from the detokenized output account for `8.3%` of the length of the original texts.**
-The loss value of `0.08300010458445718` suggests that the tokenizer performs well in maintaining the integrity of the text through the tokenization process and sustains a high level of fidelity. Mathematically, this low loss score signifies a high degree of similarity between the original and detokenized texts, demonstrating the tokenizer's effectiveness. The process of detokenization, converting tokenized representations back into their original text form, does not always guarantee a 1:1 exact match to the original text. While the goal of detokenization is to reconstruct the original text as closely as possible, minor differences can occur. These variances are generally acceptable and sometimes inevitable.
 Most NLP models and applications can tolerate some level of discrepancy between original and processed texts.

 - **Dynamic Dropout**: Tokenizer training process doesn't use predefined `dropout` but instead calculates on the fly specifically tailored value, based on the current training dataset. This ensures that tokenizer model can generalize better by putting more weight on context rather than specifics. This would be beneficial at later stage when finetuning LLM with this tokenizer.
 - **Dynamic Adaptation**: The ability to dynamically adjust tokenization parameters based (like `min_frequency`) on dataset analysis ensures that the tokenizer remains effective across different text domains.
 - **Sophisticated Evaluation**: The inclusion of a detailed evaluation mechanism enables continuous assessment and improvement of the tokenizer's performance, ensuring high accuracy and reliability.
 - **Number Bucketing**:  Numbers in the text are categorized into predefined "buckets" based on their value. The bucketing process involves dividing the number space into several ranges (or buckets) and assigning each number to a specific bucket. Each bucket is represented by its own token that follows specific convention. Common years (e.g., 1900-2025) and ages (e.g., 1-100) are exceptions to this rule and they are represented they way they are written. This reduces sparsity and improves generalization without overfitting to specific values
 - **URL Replacement**: URLs in the text are identified using a regular expression for common URL patterns and replaced with a special token `<url>`.  Replacing varied URLs with a single token prevents the model from overfitting to specific web addresses, which are usually not relevant to understanding the text's general context.URLs can introduce a vast number of unique tokens into the vocabulary. Replacing them with a single token significantly simplifies the model's vocabulary. By abstracting away the specifics of URLs, models can focus more on the actual textual content.
 | Version | Vocab Size | Loss            | Training Time (seconds) |
 |---------|------------| ----------------|-------------------------|
+| v1.1    | 32,000     | 0.00791045752872809 | 9188.8694               |
 ### **Interpreting the Evaluation Score**
 ### **Mathematical Significance of the Evaluation Score**
+From `Levenshtein distance` definition  =>  **On average, the necessary edits to recover the original text from the detokenized output account for `0.79%` of the length of the original texts.**
+The loss value of `0.00791045752872809` suggests that the tokenizer performs well in maintaining the integrity of the text through the tokenization process and sustains a high level of fidelity. Mathematically, this low loss score signifies a high degree of similarity between the original and detokenized texts, demonstrating the tokenizer's effectiveness. The process of detokenization, converting tokenized representations back into their original text form, does not always guarantee a 1:1 exact match to the original text. While the goal of detokenization is to reconstruct the original text as closely as possible, minor differences can occur. These variances are generally acceptable and sometimes inevitable.
 Most NLP models and applications can tolerate some level of discrepancy between original and processed texts.

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -1,9 +1,48 @@
 {
-  "bos_token": "<s>",
-  "clean_up_tokenization_spaces": false,
-  "eos_token": "</s>",
-  "model_max_length": 1000000000000000019884624838656,
-  "pad_token": "<s>",
-  "tokenizer_class": "PreTrainedTokenizerFast",
-  "unk_token": "<unk>"
-}

 {
+    "add_bos_token": true,
+    "add_eos_token": false,
+    "added_tokens_decoder": {
+        "0": {
+            "content": "<unk>",
+            "lstrip": false,
+            "normalized": false,
+            "rstrip": false,
+            "single_word": false,
+            "special": true
+        },
+        "1": {
+            "content": "<s>",
+            "lstrip": false,
+            "normalized": false,
+            "rstrip": false,
+            "single_word": false,
+            "special": true
+        },
+        "2": {
+            "content": "</s>",
+            "lstrip": false,
+            "normalized": false,
+            "rstrip": false,
+            "single_word": false,
+            "special": true
+        }
+    },
+    "additional_special_tokens": [
+        "<unk>",
+        "<s>",
+        "</s>"
+    ],
+    "bos_token": "<s>",
+    "clean_up_tokenization_spaces": false,
+    "eos_token": "</s>",
+    "legacy": true,
+    "loss_score": 0.00791045752872809,
+    "model_max_length": 1000000000000000019884624838656,
+    "pad_token": "<s>",
+    "spaces_between_special_tokens": false,
+    "tokenizer_class": "LlamaTokenizer",
+    "unk_token": "<unk>",
+    "use_default_system_prompt": false,
+    "use_fast": true,
+    "version": 1.1
+}