OpenEuroLLM Tokenizer v2 (256k)
SentencePiece BPE tokenizer for the OpenEuroLLM flagship models. v2 is a full retrain of the v1 256k tokenizer on a larger, cleaner corpus with the language list driven from the canonical training-data-catalogue/languages file (no hardcoded list this time β see v1's Georgian-gap incident).
Highlights vs SOTA β multi-domain eval
Evaluation is a held-out 5-domain suite (8,600 samples total), designed to test more than Wikipedia prose. Each column is mean tokens-per-whitespace-word (lower = better).
| Tokenizer | Vocab | Overall | FLORES-200 (36 langs parallel) |
Code (Python) |
Math (LaTeX+GSM8K) |
Chat (ChatML) |
PDFs (5 langs) |
|---|---|---|---|---|---|---|---|
| OpenEuroLLM v2 256k (this model) | 262,144 | 1.90 π₯ | 1.79 π₯ | 3.24 | 1.91 | 1.44 π₯ | 2.31 |
| GPT-OSS 20B | 200,000 | 2.07 | 2.07 | 2.62 | 1.64 | 1.54 | 2.01 |
| OpenEuroLLM v2 128k | 131,072 | 2.09 | 2.00 | 3.32 | 1.92 | 1.52 | 2.43 |
| Gemma 3 4B | 256,000 | 2.19 | 2.13 | 3.20 | 1.92 | 1.59 | 2.27 |
| Mistral Nemo | 131,072 | 2.23 | 2.20 | 2.84 | 1.92 | 1.62 | 2.26 |
| EuroLLM 9B | 128,000 | 2.30 | 2.21 | 3.79 | 2.02 | 1.57 | 2.48 |
| OpenEuroLLM v1 256k (predecessor) | 262,144 | 2.45 | 2.43 | 3.33 | 1.95 | 1.68 | 2.29 |
| DeepSeek V3 | 128,000 | 2.47 | 2.51 | 2.83 | 1.65 | 1.65 | 2.13 |
| OpenEuroLLM v1 128k | 131,072 | 2.62 | 2.62 | 3.42 | 1.98 | 1.76 | 2.40 |
| Llama 3.1 8B | 128,256 | 2.68 | 2.78 | 2.60 π₯ | 1.65 π₯ | 1.65 | 2.18 |
| Qwen 3 8B | 151,936 | 2.70 | 2.78 | 2.64 | 1.90 | 1.51 | 2.35 |
Eval composition: 7,200 FLORES (36 langs Γ 200 parallel sentences, held-out) Β· 500 Python (codeparrot) Β· 200 MATH+GSM8K Β· 200 OpenAssistant chat (ChatML-wrapped) Β· 500 FinePDFs (5 langs).
Summary of v2-256k vs the field:
- #1 overall (1.90) β beats every SOTA tokenizer on the multi-domain average.
- #1 on multilingual prose by a wide margin (FLORES 1.79 vs Gemma 2.13, GPT-OSS 2.07).
- #1 on chat (1.44, ChatML tokens working as intended).
- Competitive on math/PDF (within 0.3 of the leader).
- Loses on Python code (3.24 vs Llama 2.60). Llama 3's tiktoken-based BPE is more code-aggressive; even the v2 whitespace tokens don't fully close that gap.
v2 vs v1: per-language deltas on FLORES-200
FLORES-200 has parallel sentences across all languages (semantically equivalent translations), so fertility differences here are pure tokenizer effect (no content drift). Same 256k and 128k models, bold = v2 better, English first then alphabetical.
| Language | v1 256k | v2 256k | Ξ256k | v1 128k | v2 128k | Ξ128k |
|---|---|---|---|---|---|---|
| English (en) | 1.24 | 1.21 | β0.03 | 1.29 | 1.23 | β0.06 |
| Albanian (sq) | 2.26 | 1.59 | β0.67 | 2.44 | 1.76 | β0.68 |
| Basque (eu) | 2.05 | 1.90 | β0.15 | 2.28 | 2.12 | β0.17 |
| Bosnian (bs) | 1.66 | 1.60 | β0.07 | 1.84 | 1.78 | β0.06 |
| Bulgarian (bg) | 1.78 | 1.87 | +0.09 | 1.95 | 2.13 | +0.18 |
| Catalan (ca) | 1.64 | 1.57 | β0.07 | 1.77 | 1.70 | β0.07 |
| Croatian (hr) | 1.72 | 1.63 | β0.10 | 1.91 | 1.82 | β0.09 |
| Czech (cs) | 1.55 | 1.79 | +0.24 | 1.72 | 2.04 | +0.31 |
| Danish (da) | 1.62 | 1.54 | β0.07 | 1.76 | 1.69 | β0.07 |
| Dutch (nl) | 1.62 | 1.53 | β0.09 | 1.77 | 1.68 | β0.09 |
| Estonian (et) | 2.15 | 2.04 | β0.11 | 2.41 | 2.30 | β0.11 |
| Finnish (fi) | 2.42 | 2.30 | β0.11 | 2.71 | 2.59 | β0.13 |
| French (fr) | 1.60 | 1.53 | β0.06 | 1.74 | 1.67 | β0.07 |
| Galician (gl) | 1.50 | 1.44 | β0.06 | 1.64 | 1.58 | β0.06 |
| Georgian (ka) | 22.93 | 2.83 | β20.10 | 22.93 | 3.30 | β19.63 |
| German (de) | 1.48 | 1.68 | +0.19 | 1.61 | 1.86 | +0.25 |
| Greek (el) | 2.24 | 2.12 | β0.11 | 2.63 | 2.44 | β0.20 |
| Hungarian (hu) | 2.15 | 2.06 | β0.09 | 2.44 | 2.33 | β0.12 |
| Icelandic (is) | 2.01 | 1.84 | β0.17 | 2.21 | 2.05 | β0.16 |
| Irish (ga) | 1.71 | 1.60 | β0.11 | 1.91 | 1.79 | β0.12 |
| Italian (it) | 1.35 | 1.51 | +0.17 | 1.45 | 1.66 | +0.21 |
| Latvian (lv) | 3.01 | 1.94 | β1.07 | 3.18 | 2.20 | β0.98 |
| Lithuanian (lt) | 2.04 | 1.99 | β0.05 | 2.30 | 2.27 | β0.03 |
| Macedonian (mk) | 1.81 | 1.89 | +0.08 | 1.99 | 2.13 | +0.13 |
| Maltese (mt) | 2.34 | 2.22 | β0.12 | 2.59 | 2.47 | β0.12 |
| Norwegian (no) | 1.55 | 1.52 | β0.03 | 1.69 | 1.65 | β0.04 |
| Polish (pl) | 1.74 | 1.90 | +0.17 | 1.95 | 2.16 | +0.21 |
| Portuguese (pt) | 1.52 | 1.45 | β0.07 | 1.66 | 1.60 | β0.06 |
| Romanian (ro) | 1.76 | 1.58 | β0.17 | 1.92 | 1.75 | β0.17 |
| Serbian (sr) | 2.03 | 1.97 | β0.06 | 2.20 | 2.23 | +0.02 |
| Slovak (sk) | 1.86 | 1.91 | +0.05 | 2.04 | 2.12 | +0.08 |
| Slovene (sl) | 1.78 | 1.74 | β0.04 | 1.97 | 1.93 | β0.04 |
| Spanish (es) | 1.47 | 1.41 | β0.06 | 1.60 | 1.54 | β0.05 |
| Swedish (sv) | 1.70 | 1.66 | β0.04 | 1.85 | 1.81 | β0.04 |
| Turkish (tr) | 2.12 | 1.92 | β0.20 | 2.40 | 2.16 | β0.24 |
| Ukrainian (uk) | 2.16 | 2.13 | β0.03 | 2.42 | 2.46 | +0.04 |
| Average (36 catalogue langs) | 2.43 | 1.79 | β0.64 | 2.62 | 2.00 | β0.62 |
v2-256k improves on 30/36 languages. Biggest wins: Georgian β20.10 (v1 was full byte-fallback; v2 has real script subwords), Latvian β1.07, Albanian β0.67. The few regressions (bg/cs/de/it/mk/pl/sk) are small (+0.05 to +0.24) and reflect that v2 spread vocabulary across more code/whitespace coverage. Note: lb/ru/cy aren't tested here β they're not in the catalogue and were dropped from v2's training scope.
Language coverage
36 catalogue languages: bg, bs, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, is, it, ka (new in v2), lt, lv, mk, mt, nl, no, pl, pt, ro, sk, sl, sq, sr, sv, tr, uk.
Languages removed from v2 vs v1: lb (Luxembourgish), ru (Russian), cy (Welsh) β not in the OpenEuroLLM catalogue.
Training details
- Algorithm: SentencePiece BPE
- Vocab size: 262,144 (2^18)
- Normalization: identity (lossless)
- Byte fallback: enabled
- Corpus: 500 GB streamed/sampled from the OpenEuroLLM "baby" cycle release shards on LUMI (10 May 2026 packer), spanning dclm, nemotron-cc, finepdfs, finepdfs-edu, olmo-mix (wiki/arxiv/pes2o), starcoder, finemath-4plus, megamath (text-code-block, web-pro), hplt-3.0, nemotron-cc-opus-1.1, nemotron-cc-tower+-0.1.
- Mix: ~70% English / ~7% code / ~5% math / ~18% other-langs (35 catalogue non-English languages, equal allocation).
- Character coverage: 0.9995
- Max piece length: 16
Special tokens
Core (locked at fixed IDs, in-vocab):
| Token | ID |
|---|---|
<unk> |
0 |
<bos> |
1 |
<eos> |
2 |
<pad> |
3 |
This fixes a v1 bug where <pad> was tacked on at vocab_size+0 (262144), out-of-vocab for downstream code that asserts pad_token_id < vocab_size.
User-defined symbols (204 total) β new for v2
Whitespace family β better code efficiency (45 tokens)
Multi-character whitespace runs are reserved as single tokens so code with deep indentation doesn't burn a token per space. None of these were in v1; pretrained tokenizers without them tokenize a 32-space indent as 16+ tokens, v2 does it in 1.
| Bucket | Tokens | Reserved IDs |
|---|---|---|
| Multi-space indents | " " (2 spaces), " " (3), β¦ " "*32 (32) |
31 tokens |
| Tabs | "\t", "\t\t", β¦ "\t"*8 |
8 tokens |
| Multi-newline | "\n\n", "\n\n\n", "\n\n\n\n" |
3 tokens |
| Common code combos | "\r\n", "\t\n", " \n" (4-space-then-newline) |
3 tokens |
Example:
>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k-v2")
>>> python_code = "def f():\n return 1" # 8-space indent
>>> len(tok.encode("def f():\n return 1", add_special_tokens=False))
# v2: tokenizes the 8-space indent as ONE token
# Llama 3.1: tokenizes it as ~8 separate tokens (one per space-pair)
StarCoder-style code corpus markers (16 tokens)
Reserved for code-corpus-formatted inputs (<filename>foo.py\n... <file_sep>\n<reponame>OpenEuroLLM/x\n...) β biggest single contributor to StarCoder2's code quality:
<filename>, <reponame>, <file_sep>, <gh_stars>, <empty_output>, <issue_start>, <issue_comment>, <issue_closed>, <jupyter_start>, <jupyter_text>, <jupyter_code>, <jupyter_output>, <jupyter_script>, <commit_before>, <commit_msg>, <commit_after>.
v1 had none of these.
Fill-in-the-middle (FIM) β code completion (4 tokens)
Full 4-token FIM set for code-completion training. v1 had 3 (missing <fim_pad>).
<fim_prefix>, <fim_middle>, <fim_suffix>, <fim_pad>
Chat formats (4 tokens)
ChatML as the modern default; Gemma-style retained from v1 for compatibility.
| Format | Tokens | Note |
|---|---|---|
| ChatML (primary) | `< | im_start |
| Gemma-style | <start_of_turn>, <end_of_turn> |
Carried over from v1 |
Tool use (2 tokens)
<tool_call>, </tool_call> (carried over from v1).
Reasoning / chain-of-thought (2 tokens) β new for v2
<think>, </think> β DeepSeek-R1 / Qwen3 convention. Reserved up-front so future thinking-style post-training doesn't need to retokenize.
Multimodal (3 tokens)
<start_of_image>, <end_of_image>, <image_soft_token> (carried over from v1).
Reserved for future (128 tokens)
<unused_0> β¦ <unused_127> β forward-compat slots. v1 had 100; v2 expands to 128 (Llama 3 reserves 256, this is a middle ground at ~0.05% of vocab).
v1 vs v2 special-tokens summary
| Category | v1 | v2 | Ξ |
|---|---|---|---|
| Core specials in-vocab (unk/bos/eos/pad) | β pad at vocab+0 | β pad=3 | fixed |
| Whitespace family | β none | β 45 tokens | new |
| StarCoder code markers | β none | β 16 tokens | new |
| FIM | 3 tokens | 4 tokens | + <fim_pad> |
| ChatML | β | β | new |
| Gemma chat | β | β | kept |
| Reasoning | β | β
<think>/</think> |
new |
| Tool / multimodal | β | β | kept |
| Reserved slots | 100 | 128 | +28 |
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k-v2", use_fast=True)
Or with SentencePiece directly:
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
ids = sp.EncodeAsIds("ααααα α―ααα αα‘αα€ααα") # "Hello world" in Georgian
Files
tokenizer.modelβ SentencePiece BPE model (4.4 MB)tokenizer.vocabβ vocabulary listingspecial_tokens_map.jsonβ HF special tokens maptokenizer_config.jsonβ HF tokenizer config
Citation
Built for the OpenEuroLLM project (Horizon Europe). Source repo: https://github.com/OpenEuroLLM/tokenizer.