kl3m-005-multi-word-example-32k tokenizer

The kl3m-005-multi-word-example-32k tokenizer is an experimental domain-specific tokenizer that introduces multi-word token learning by using random whitespace pre-tokenization during training. This allows the tokenizer to learn complete multi-word expressions as single tokens, improving compression and semantic retention for domain-specific terminology.

This tokenizer was trained on a stratified sample of nearly 4M documents across general, legal, and financial domains from the kl3m-data project, including American English, British English, Spanish, German, French, Italian, and other common EU languages.

Model Details

Summary

Vocabulary: 32,768
Tokenizer type: BPE with multi-word capability
Special token support: Both causal and masked language modeling
Language(s) (NLP): Primarily English, Spanish, German, French, with a small percentage of other EU languages.
Data Sources: See kl3m-data repository.
Developed by: ALEA Institute.
License: CC-BY 4.0

Model Description

The kl3m-005-multi-word-example-32k tokenizer introduces a novel technique for multi-word token learning that avoids the complexity of previous multi-word tokenization approaches. Instead of post-processing or complex token merging strategies, this tokenizer uses specialized pre-tokenization during training that randomly decides whether to split on whitespace or not.

This tokenizer is notable for a number of reasons:

Multi-Word Token Learning

The key innovation in this tokenizer is the implementation of random whitespace pre-tokenization during training. This technique:

Uses RandomWhitespaceSplit pre-tokenizer, which probabilistically decides whether to split on whitespace
Enables learning of multi-word units as single tokens (e.g., "of the", "in the", "United States")
Improves compression and semantic coherence for common multi-word expressions
Doesn't require complex hyperparameter transitions or multi-phase training

This implementation is based on the new pre-tokenizers added to the Hugging Face tokenizers library that enable multi-word token learning. For more information, see Hugging Face PR #1753.

Domain Specific

As with previous KL3M tokenizers, this tokenizer was trained on a large corpus of financial and legal text. This tokenizer has not seen any common general pretrain sources like Wikipedia or Common Crawl, making it highly specialized for its target domains.

Large Added Token Set

Similar to other KL3M tokenizers, we included a large number of deterministic "whole" tokens in the vocabulary:

HTML tags like <span
Common Markdown elements like # and ##
Legal enumerations like (a)
Academic and legal citations

Special Tokens

For both training and inference efficiency, we included special tokens suitable for both causal and masked language modeling tasks:

<|start|>: 0
<|end|>: 1
<|pad|>: 2
<|unk|>: 3
<|sep|>: 4
<|cls|>: 5
<|mask|>: 6
<|system|>: 7
</|system|>: 8
<|user|>: 9
</|user|>: 10
<|instruction|>: 11
</|instruction|>: 12

Examples

Here's an example of how this tokenizer produces different token sequences compared to standard tokenizers:

Original text: The Supreme Court of the United States has ruled that free speech is protected under the First Amendment.

Standard BPE tokenization:
["The", " Supreme", " Court", " of", " the", " United", " States", " has", " ruled", " that", " free", " speech", " is", " protected", " under", " the", " First", " Amendment", "."]

kl3m-005-multi-word-example-32k:
["The", " Supreme Court", " of the", " United States", " has", " ruled", " that", " free speech", " is", " protected", " under the", " First Amendment", "."]

Notice how the multi-word tokenizer captures complete phrases like "Supreme Court", "of the", "United States", "free speech", and "First Amendment" as single tokens, improving compression and preserving semantic units.

Replication

The entire data collection and preprocessing pipeline is being made available as part of the ALEA Institute KL3M project.

The source code used to train the tokenizer is available on GitHub at: https://github.com/alea-institute/kl3m-tokenizers

Uses

This tokenizer is intended for English, Spanish, German, or French language text in professional contexts such as legal and financial documents. It's particularly useful for applications where preserving multi-word expressions is important for semantic understanding and generation.

Recommendations

The kl3m-005-multi-word-example-32k tokenizer is recommended for:

Legal or financial document processing where multi-word terms are common
Applications where token compression is critical
Research into multi-word token approaches
Tasks requiring better semantic coherence in tokenization

For more traditional tokenization, consider the kl3m-004-128k-cased or other KL3M tokenizers.

How to Get Started with the Model

Use the code below to get started with the model:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-005-multi-word-example-32k')

# Example showing multi-word tokens
text = "The Supreme Court of the United States has ruled that free speech is protected under the First Amendment."
encoded = tokenizer.encode(text)
tokens = encoded.tokens

print(f"Token count: {len(tokens)}")
print("Tokens:", tokens)

Citation

Tokenizer and dataset publications are pending.

Contact

For any questions, please contact ALEA Institute at hello@aleainstitute.ai or create an issue on this repository or GitHub.