toke BPE Tokenizer
A purpose-built 16K BPE tokenizer for the toke programming language, achieving 52% average token reduction vs cl100k_base across 42 benchmark programs.
Key Facts
| Property | Value |
|---|---|
| Vocab size | 16,384 |
| Training data | 25,953 toke programs + 698 loke production modules |
| Average reduction | 52% vs cl100k_base |
| Best case | 76% reduction (simple loop) |
| String handling | Contents replaced with placeholder before training |
This is NOT the model's tokenizer
The toke code generation model (karwalski/toke) uses Qwen's 151K vocab tokenizer internally. This tokenizer measures how efficiently toke code could be tokenized by a future toke-native model.
Usage
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer_v03.json")
code = 'm=fib;f=fib(n:i64):i64{if(n<2){<n};<fib(n-1)+fib(n-2)};'
result = tok.encode(code)
print(f"{len(result.ids)} tokens") # 19 tokens (vs 49 cl100k)
Interactive Demo
Try the tokenizer in your browser at tokelang.dev/tokenizer โ see token boundaries highlighted with colours, side-by-side with cl100k.
Links
- tokelang.dev โ Project website
- Token comparison (42 examples)
- GitHub: toke โ Compiler and spec
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support