Fast ByteLevel BPE Tokenizer

A fast ByteLevel BPE tokenizer trained on Modal using a mixed English, Spanish, code, Wikipedia, and educational web corpus.

Tokenizer trained in 3.50 minutes.

Overview

This tokenizer was trained from scratch using Hugging Face tokenizers with a ByteLevel BPE setup.

Training Stats

Field	Value
Tokenizer type	ByteLevel BPE
Vocab size	32,000
Target texts	700,000
Elapsed minutes	3.50
Training platform	Modal
Output file	`tokenizer.json`

Dataset Mix

Share	Dataset	Config	Split	Column
50%	`allenai/c4`	`en`	`train`	`text`
20%	`HuggingFaceFW/fineweb-edu`	`None`	`train`	`text`
10%	`wikimedia/wikipedia`	`20231101.en`	`train`	`text`
10%	`codeparrot/codeparrot-clean`	`None`	`train`	`content`
10%	`allenai/c4`	`es`	`train`	`text`

Files

Expected files:

tokenizer-bpe-32k/
├── tokenizer.json
├── metadata.json
├── README.md
├── vocab.json      # optional, if exported
└── merges.txt      # optional, if exported

tokenizer.json is the main all-in-one tokenizer file.

vocab.json and merges.txt are optional classic BPE files. Some older GPT-2/RoBERTa-style tools may ask for them.

Install

pip install tokenizers

Load Tokenizer

from tokenizers import Tokenizer

tok = Tokenizer.from_file("./tokenizer-bpe-32k/tokenizer.json")

enc = tok.encode("Hello!")
print(enc.tokens)
print(enc.ids)

Example output:

['ĠHello', '!']
[25831, 5]

Token Examples

English

Input:  Hello!
Tokens: ['ĠHello', '!']
IDs:    [25831, 5]
Count:  2

Spanish

Input:  Hola amigo, el tokenizer funciona muy bien.
Tokens: ['ĠHol', 'a', 'Ġamigo', ',', 'Ġel', 'Ġtoken', 'izer', 'Ġfunciona', 'Ġmuy', 'Ġbien', '.']
Count:  11

Code

Input:
import torch
print(torch.__version__)

Tokens:
['Ġimport', 'Ġtor', 'ch', 'Ċ', 'print', '(', 'tor', 'ch', '.__', 'version', '__)']

Count: 11

Meme / Emoji Text

Input:  BROOOOOOOOOOOOOO 💀💀💀🔥🔥🔥
Count:  24

Emoji-heavy text may split into many byte-level pieces. That is normal for ByteLevel BPE.

Notes

What does `Ġ` mean?

Ġ marks a space before a token. For example:

ĠHello

means the token represents Hello with a leading space behavior.

What does `Ċ` mean?

Ċ represents a newline in byte-level tokenization.

Why does decoding add a leading space?

This tokenizer was trained with ByteLevel behavior that adds a prefix space. So encoding Hello! can decode as:

" Hello!"

This is normal for this tokenizer style.

Export `vocab.json` and `merges.txt`

If you only have tokenizer.json, you can export classic BPE files like this:

from tokenizers import Tokenizer

out = "./tokenizer-bpe-32k"
tok = Tokenizer.from_file(f"{out}/tokenizer.json")

tok.model.save(out)

This should create:

vocab.json
merges.txt

Modal Download Command

To download the tokenizer folder from the Modal Volume:

modal volume get tokenizer-outputs /tokenizer-bpe-32k .

Quick Test

python - <<'PY'
from tokenizers import Tokenizer

tok = Tokenizer.from_file("./tokenizer-bpe-32k/tokenizer.json")

tests = [
    "Hello!",
    "The quick brown fox jumps over the lazy dog.",
    "Hola amigo, el tokenizer funciona muy bien.",
    "def hello_world(): print('hi')",
    "BROOOOOOOOOOOOOO 💀🔥"
]

for text in tests:
    enc = tok.encode(text)
    print("\nTEXT:", text)
    print("TOKENS:", enc.tokens)
    print("IDS:", enc.ids)
    print("COUNT:", len(enc.ids))
PY

License

This tokenizer is released under the Apache License 2.0.

The tokenizer artifacts include:

tokenizer.json
vocab.json
merges.txt
metadata.json

The license applies to the tokenizer files in this repository.

Dataset Attribution

This tokenizer was trained on a mixture of public datasets:

allenai/c4
HuggingFaceFW/fineweb-edu
wikimedia/wikipedia
codeparrot/codeparrot-clean
allenai/c4 Spanish config

Users should respect the licenses and terms of the original datasets. The allenai/c4 dataset card lists its license as odc-by, so attribution is especially important.

Final Verdict

This tokenizer is strong for:

English
Spanish
Python/code-like text
URLs and emails
General web text

It is weaker for:

Emoji-heavy text
CJK scripts
Korean
Arabic
Very meme-specific strings

For a 3.50 minute Modal run, this is a clean good-tier tokenizer.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support