Fast ByteLevel BPE Tokenizer
A fast ByteLevel BPE tokenizer trained on Modal using a mixed English, Spanish, code, Wikipedia, and educational web corpus.
Tokenizer trained in 3.50 minutes.
Overview
This tokenizer was trained from scratch using Hugging Face tokenizers with a ByteLevel BPE setup.
Training Stats
| Field | Value |
|---|---|
| Tokenizer type | ByteLevel BPE |
| Vocab size | 32,000 |
| Target texts | 700,000 |
| Elapsed minutes | 3.50 |
| Training platform | Modal |
| Output file | tokenizer.json |
Dataset Mix
| Share | Dataset | Config | Split | Column |
|---|---|---|---|---|
| 50% | allenai/c4 |
en |
train |
text |
| 20% | HuggingFaceFW/fineweb-edu |
None |
train |
text |
| 10% | wikimedia/wikipedia |
20231101.en |
train |
text |
| 10% | codeparrot/codeparrot-clean |
None |
train |
content |
| 10% | allenai/c4 |
es |
train |
text |
Files
Expected files:
tokenizer-bpe-32k/
├── tokenizer.json
├── metadata.json
├── README.md
├── vocab.json # optional, if exported
└── merges.txt # optional, if exported
tokenizer.json is the main all-in-one tokenizer file.
vocab.json and merges.txt are optional classic BPE files. Some older GPT-2/RoBERTa-style tools may ask for them.
Install
pip install tokenizers
Load Tokenizer
from tokenizers import Tokenizer
tok = Tokenizer.from_file("./tokenizer-bpe-32k/tokenizer.json")
enc = tok.encode("Hello!")
print(enc.tokens)
print(enc.ids)
Example output:
['ĠHello', '!']
[25831, 5]
Token Examples
English
Input: Hello!
Tokens: ['ĠHello', '!']
IDs: [25831, 5]
Count: 2
Spanish
Input: Hola amigo, el tokenizer funciona muy bien.
Tokens: ['ĠHol', 'a', 'Ġamigo', ',', 'Ġel', 'Ġtoken', 'izer', 'Ġfunciona', 'Ġmuy', 'Ġbien', '.']
Count: 11
Code
Input:
import torch
print(torch.__version__)
Tokens:
['Ġimport', 'Ġtor', 'ch', 'Ċ', 'print', '(', 'tor', 'ch', '.__', 'version', '__)']
Count: 11
Meme / Emoji Text
Input: BROOOOOOOOOOOOOO 💀💀💀🔥🔥🔥
Count: 24
Emoji-heavy text may split into many byte-level pieces. That is normal for ByteLevel BPE.
Notes
What does Ġ mean?
Ġ marks a space before a token. For example:
ĠHello
means the token represents Hello with a leading space behavior.
What does Ċ mean?
Ċ represents a newline in byte-level tokenization.
Why does decoding add a leading space?
This tokenizer was trained with ByteLevel behavior that adds a prefix space. So encoding Hello! can decode as:
" Hello!"
This is normal for this tokenizer style.
Export vocab.json and merges.txt
If you only have tokenizer.json, you can export classic BPE files like this:
from tokenizers import Tokenizer
out = "./tokenizer-bpe-32k"
tok = Tokenizer.from_file(f"{out}/tokenizer.json")
tok.model.save(out)
This should create:
vocab.json
merges.txt
Modal Download Command
To download the tokenizer folder from the Modal Volume:
modal volume get tokenizer-outputs /tokenizer-bpe-32k .
Quick Test
python - <<'PY'
from tokenizers import Tokenizer
tok = Tokenizer.from_file("./tokenizer-bpe-32k/tokenizer.json")
tests = [
"Hello!",
"The quick brown fox jumps over the lazy dog.",
"Hola amigo, el tokenizer funciona muy bien.",
"def hello_world(): print('hi')",
"BROOOOOOOOOOOOOO 💀🔥"
]
for text in tests:
enc = tok.encode(text)
print("\nTEXT:", text)
print("TOKENS:", enc.tokens)
print("IDS:", enc.ids)
print("COUNT:", len(enc.ids))
PY
License
This tokenizer is released under the Apache License 2.0.
The tokenizer artifacts include:
tokenizer.jsonvocab.jsonmerges.txtmetadata.json
The license applies to the tokenizer files in this repository.
Dataset Attribution
This tokenizer was trained on a mixture of public datasets:
allenai/c4HuggingFaceFW/fineweb-eduwikimedia/wikipediacodeparrot/codeparrot-cleanallenai/c4Spanish config
Users should respect the licenses and terms of the original datasets. The allenai/c4 dataset card lists its license as odc-by, so attribution is especially important.
Final Verdict
This tokenizer is strong for:
- English
- Spanish
- Python/code-like text
- URLs and emails
- General web text
It is weaker for:
- Emoji-heavy text
- CJK scripts
- Korean
- Arabic
- Very meme-specific strings
For a 3.50 minute Modal run, this is a clean good-tier tokenizer.