kl3m-005-multi-word-example-32k tokenizer
The kl3m-005-multi-word-example-32k
tokenizer is an experimental domain-specific tokenizer that introduces multi-word token learning by using random whitespace pre-tokenization during training. This allows the tokenizer to learn complete multi-word expressions as single tokens, improving compression and semantic retention for domain-specific terminology.
This tokenizer was trained on a stratified sample of nearly 4M documents across general, legal, and financial domains from the kl3m-data
project, including American English, British English, Spanish, German, French, Italian, and other common EU languages.
Model Details
Summary
- Vocabulary: 32,768
- Tokenizer type: BPE with multi-word capability
- Special token support: Both causal and masked language modeling
- Language(s) (NLP): Primarily English, Spanish, German, French, with a small percentage of other EU languages.
- Data Sources: See
kl3m-data
repository. - Developed by: ALEA Institute.
- License: CC-BY 4.0
Model Description
The kl3m-005-multi-word-example-32k
tokenizer introduces a novel technique for multi-word token learning that avoids the complexity of previous multi-word tokenization approaches. Instead of post-processing or complex token merging strategies, this tokenizer uses specialized pre-tokenization during training that randomly decides whether to split on whitespace or not.
This tokenizer is notable for a number of reasons:
Multi-Word Token Learning
The key innovation in this tokenizer is the implementation of random whitespace pre-tokenization during training. This technique:
- Uses
RandomWhitespaceSplit
pre-tokenizer, which probabilistically decides whether to split on whitespace - Enables learning of multi-word units as single tokens (e.g., "of the", "in the", "United States")
- Improves compression and semantic coherence for common multi-word expressions
- Doesn't require complex hyperparameter transitions or multi-phase training
This implementation is based on the new pre-tokenizers added to the Hugging Face tokenizers
library that enable multi-word token learning. For more information, see Hugging Face PR #1753.
Domain Specific
As with previous KL3M tokenizers, this tokenizer was trained on a large corpus of financial and legal text. This tokenizer has not seen any common general pretrain sources like Wikipedia or Common Crawl, making it highly specialized for its target domains.
Large Added Token Set
Similar to other KL3M tokenizers, we included a large number of deterministic "whole" tokens in the vocabulary:
- HTML tags like
<span
- Common Markdown elements like
#
and##
- Legal enumerations like
(a)
- Academic and legal citations
Special Tokens
For both training and inference efficiency, we included special tokens suitable for both causal and masked language modeling tasks:
<|start|>
:0
<|end|>
:1
<|pad|>
:2
<|unk|>
:3
<|sep|>
:4
<|cls|>
:5
<|mask|>
:6
<|system|>
:7
</|system|>
:8
<|user|>
:9
</|user|>
:10
<|instruction|>
:11
</|instruction|>
:12
Examples
Here's an example of how this tokenizer produces different token sequences compared to standard tokenizers:
Original text: The Supreme Court of the United States has ruled that free speech is protected under the First Amendment.
Standard BPE tokenization:
["The", " Supreme", " Court", " of", " the", " United", " States", " has", " ruled", " that", " free", " speech", " is", " protected", " under", " the", " First", " Amendment", "."]
kl3m-005-multi-word-example-32k:
["The", " Supreme Court", " of the", " United States", " has", " ruled", " that", " free speech", " is", " protected", " under the", " First Amendment", "."]
Notice how the multi-word tokenizer captures complete phrases like "Supreme Court", "of the", "United States", "free speech", and "First Amendment" as single tokens, improving compression and preserving semantic units.
Replication
The entire data collection and preprocessing pipeline is being made available as part of the ALEA Institute KL3M project.
The source code used to train the tokenizer is available on GitHub at: https://github.com/alea-institute/kl3m-tokenizers
Uses
This tokenizer is intended for English, Spanish, German, or French language text in professional contexts such as legal and financial documents. It's particularly useful for applications where preserving multi-word expressions is important for semantic understanding and generation.
Recommendations
The kl3m-005-multi-word-example-32k
tokenizer is recommended for:
- Legal or financial document processing where multi-word terms are common
- Applications where token compression is critical
- Research into multi-word token approaches
- Tasks requiring better semantic coherence in tokenization
For more traditional tokenization, consider the kl3m-004-128k-cased
or other KL3M tokenizers.
How to Get Started with the Model
Use the code below to get started with the model:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-005-multi-word-example-32k')
# Example showing multi-word tokens
text = "The Supreme Court of the United States has ruled that free speech is protected under the First Amendment."
encoded = tokenizer.encode(text)
tokens = encoded.tokens
print(f"Token count: {len(tokens)}")
print("Tokens:", tokens)
Citation
Tokenizer and dataset publications are pending.
Contact
For any questions, please contact ALEA Institute at hello@aleainstitute.ai or create an issue on this repository or GitHub.