OpenEuroLLM Tokenizer v2 (256k)

SentencePiece BPE tokenizer for the OpenEuroLLM flagship models. v2 is a full retrain of the v1 256k tokenizer on a larger, cleaner corpus with the language list driven from the canonical training-data-catalogue/languages file (no hardcoded list this time — see v1's Georgian-gap incident).

Highlights vs SOTA — multi-domain eval

Evaluation is a held-out 5-domain suite (8,600 samples total), designed to test more than Wikipedia prose. Each column is mean tokens-per-whitespace-word (lower = better).

Tokenizer	Vocab	Overall	FLORES-200 (36 langs parallel)	Code (Python)	Math (LaTeX+GSM8K)	Chat (ChatML)	PDFs (5 langs)
OpenEuroLLM v2 256k (this model)	262,144	1.90 🥇	1.79 🥇	3.24	1.91	1.44 🥇	2.31
GPT-OSS 20B	200,000	2.07	2.07	2.62	1.64	1.54	2.01
OpenEuroLLM v2 128k	131,072	2.09	2.00	3.32	1.92	1.52	2.43
Gemma 3 4B	256,000	2.19	2.13	3.20	1.92	1.59	2.27
Mistral Nemo	131,072	2.23	2.20	2.84	1.92	1.62	2.26
EuroLLM 9B	128,000	2.30	2.21	3.79	2.02	1.57	2.48
OpenEuroLLM v1 256k (predecessor)	262,144	2.45	2.43	3.33	1.95	1.68	2.29
DeepSeek V3	128,000	2.47	2.51	2.83	1.65	1.65	2.13
OpenEuroLLM v1 128k	131,072	2.62	2.62	3.42	1.98	1.76	2.40
Llama 3.1 8B	128,256	2.68	2.78	2.60 🥇	1.65 🥇	1.65	2.18
Qwen 3 8B	151,936	2.70	2.78	2.64	1.90	1.51	2.35

Eval composition: 7,200 FLORES (36 langs × 200 parallel sentences, held-out) · 500 Python (codeparrot) · 200 MATH+GSM8K · 200 OpenAssistant chat (ChatML-wrapped) · 500 FinePDFs (5 langs).

Summary of v2-256k vs the field:

#1 overall (1.90) — beats every SOTA tokenizer on the multi-domain average.
#1 on multilingual prose by a wide margin (FLORES 1.79 vs Gemma 2.13, GPT-OSS 2.07).
#1 on chat (1.44, ChatML tokens working as intended).
Competitive on math/PDF (within 0.3 of the leader).
Loses on Python code (3.24 vs Llama 2.60). Llama 3's tiktoken-based BPE is more code-aggressive; even the v2 whitespace tokens don't fully close that gap.

v2 vs v1: per-language deltas on FLORES-200

FLORES-200 has parallel sentences across all languages (semantically equivalent translations), so fertility differences here are pure tokenizer effect (no content drift). Same 256k and 128k models, bold = v2 better, English first then alphabetical.

Language	v1 256k	v2 256k	Δ256k	v1 128k	v2 128k	Δ128k
English (en)	1.24	1.21	−0.03	1.29	1.23	−0.06
Albanian (sq)	2.26	1.59	−0.67	2.44	1.76	−0.68
Basque (eu)	2.05	1.90	−0.15	2.28	2.12	−0.17
Bosnian (bs)	1.66	1.60	−0.07	1.84	1.78	−0.06
Bulgarian (bg)	1.78	1.87	+0.09	1.95	2.13	+0.18
Catalan (ca)	1.64	1.57	−0.07	1.77	1.70	−0.07
Croatian (hr)	1.72	1.63	−0.10	1.91	1.82	−0.09
Czech (cs)	1.55	1.79	+0.24	1.72	2.04	+0.31
Danish (da)	1.62	1.54	−0.07	1.76	1.69	−0.07
Dutch (nl)	1.62	1.53	−0.09	1.77	1.68	−0.09
Estonian (et)	2.15	2.04	−0.11	2.41	2.30	−0.11
Finnish (fi)	2.42	2.30	−0.11	2.71	2.59	−0.13
French (fr)	1.60	1.53	−0.06	1.74	1.67	−0.07
Galician (gl)	1.50	1.44	−0.06	1.64	1.58	−0.06
Georgian (ka)	22.93	2.83	−20.10	22.93	3.30	−19.63
German (de)	1.48	1.68	+0.19	1.61	1.86	+0.25
Greek (el)	2.24	2.12	−0.11	2.63	2.44	−0.20
Hungarian (hu)	2.15	2.06	−0.09	2.44	2.33	−0.12
Icelandic (is)	2.01	1.84	−0.17	2.21	2.05	−0.16
Irish (ga)	1.71	1.60	−0.11	1.91	1.79	−0.12
Italian (it)	1.35	1.51	+0.17	1.45	1.66	+0.21
Latvian (lv)	3.01	1.94	−1.07	3.18	2.20	−0.98
Lithuanian (lt)	2.04	1.99	−0.05	2.30	2.27	−0.03
Macedonian (mk)	1.81	1.89	+0.08	1.99	2.13	+0.13
Maltese (mt)	2.34	2.22	−0.12	2.59	2.47	−0.12
Norwegian (no)	1.55	1.52	−0.03	1.69	1.65	−0.04
Polish (pl)	1.74	1.90	+0.17	1.95	2.16	+0.21
Portuguese (pt)	1.52	1.45	−0.07	1.66	1.60	−0.06
Romanian (ro)	1.76	1.58	−0.17	1.92	1.75	−0.17
Serbian (sr)	2.03	1.97	−0.06	2.20	2.23	+0.02
Slovak (sk)	1.86	1.91	+0.05	2.04	2.12	+0.08
Slovene (sl)	1.78	1.74	−0.04	1.97	1.93	−0.04
Spanish (es)	1.47	1.41	−0.06	1.60	1.54	−0.05
Swedish (sv)	1.70	1.66	−0.04	1.85	1.81	−0.04
Turkish (tr)	2.12	1.92	−0.20	2.40	2.16	−0.24
Ukrainian (uk)	2.16	2.13	−0.03	2.42	2.46	+0.04
Average (36 catalogue langs)	2.43	1.79	−0.64	2.62	2.00	−0.62

v2-256k improves on 30/36 languages. Biggest wins: Georgian −20.10 (v1 was full byte-fallback; v2 has real script subwords), Latvian −1.07, Albanian −0.67. The few regressions (bg/cs/de/it/mk/pl/sk) are small (+0.05 to +0.24) and reflect that v2 spread vocabulary across more code/whitespace coverage. Note: lb/ru/cy aren't tested here — they're not in the catalogue and were dropped from v2's training scope.

Language coverage

36 catalogue languages: bg, bs, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, is, it, ka (new in v2), lt, lv, mk, mt, nl, no, pl, pt, ro, sk, sl, sq, sr, sv, tr, uk.

Languages removed from v2 vs v1: lb (Luxembourgish), ru (Russian), cy (Welsh) — not in the OpenEuroLLM catalogue.

Training details

Algorithm: SentencePiece BPE
Vocab size: 262,144 (2^18)
Normalization: identity (lossless)
Byte fallback: enabled
Corpus: 500 GB streamed/sampled from the OpenEuroLLM "baby" cycle release shards on LUMI (10 May 2026 packer), spanning dclm, nemotron-cc, finepdfs, finepdfs-edu, olmo-mix (wiki/arxiv/pes2o), starcoder, finemath-4plus, megamath (text-code-block, web-pro), hplt-3.0, nemotron-cc-opus-1.1, nemotron-cc-tower+-0.1.
Mix: ~70% English / ~7% code / ~5% math / ~18% other-langs (35 catalogue non-English languages, equal allocation).
Character coverage: 0.9995
Max piece length: 16

Special tokens

Core (locked at fixed IDs, in-vocab):

Token	ID
`<unk>`	0
`<bos>`	1
`<eos>`	2
`<pad>`	3

This fixes a v1 bug where <pad> was tacked on at vocab_size+0 (262144), out-of-vocab for downstream code that asserts pad_token_id < vocab_size.

User-defined symbols (204 total) — new for v2

Whitespace family — better code efficiency (45 tokens)

Multi-character whitespace runs are reserved as single tokens so code with deep indentation doesn't burn a token per space. None of these were in v1; pretrained tokenizers without them tokenize a 32-space indent as 16+ tokens, v2 does it in 1.

Bucket	Tokens	Reserved IDs
Multi-space indents	`" "` (2 spaces), `" "` (3), … `" "*32` (32)	31 tokens
Tabs	`"\t"`, `"\t\t"`, … `"\t"*8`	8 tokens
Multi-newline	`"\n\n"`, `"\n\n\n"`, `"\n\n\n\n"`	3 tokens
Common code combos	`"\r\n"`, `"\t\n"`, `" \n"` (4-space-then-newline)	3 tokens

Example:

>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k-v2")
>>> python_code = "def f():\n        return 1"  # 8-space indent
>>> len(tok.encode("def f():\n        return 1", add_special_tokens=False))
# v2: tokenizes the 8-space indent as ONE token
# Llama 3.1: tokenizes it as ~8 separate tokens (one per space-pair)

StarCoder-style code corpus markers (16 tokens)

Reserved for code-corpus-formatted inputs (<filename>foo.py\n... <file_sep>\n<reponame>OpenEuroLLM/x\n...) — biggest single contributor to StarCoder2's code quality:

<filename>, <reponame>, <file_sep>, <gh_stars>, <empty_output>, <issue_start>, <issue_comment>, <issue_closed>, <jupyter_start>, <jupyter_text>, <jupyter_code>, <jupyter_output>, <jupyter_script>, <commit_before>, <commit_msg>, <commit_after>.

v1 had none of these.

Fill-in-the-middle (FIM) — code completion (4 tokens)

Full 4-token FIM set for code-completion training. v1 had 3 (missing <fim_pad>).

<fim_prefix>, <fim_middle>, <fim_suffix>, <fim_pad>

Chat formats (4 tokens)

ChatML as the modern default; Gemma-style retained from v1 for compatibility.

Format	Tokens	Note
ChatML (primary)	`<	im_start
Gemma-style	`<start_of_turn>`, `<end_of_turn>`	Carried over from v1

Tool use (2 tokens)

<tool_call>, </tool_call> (carried over from v1).

Reasoning / chain-of-thought (2 tokens) — new for v2

<think>, </think> — DeepSeek-R1 / Qwen3 convention. Reserved up-front so future thinking-style post-training doesn't need to retokenize.

Multimodal (3 tokens)

<start_of_image>, <end_of_image>, <image_soft_token> (carried over from v1).

Reserved for future (128 tokens)

<unused_0> … <unused_127> — forward-compat slots. v1 had 100; v2 expands to 128 (Llama 3 reserves 256, this is a middle ground at ~0.05% of vocab).

v1 vs v2 special-tokens summary

Category	v1	v2	Δ
Core specials in-vocab (unk/bos/eos/pad)	❌ pad at vocab+0	✅ pad=3	fixed
Whitespace family	❌ none	✅ 45 tokens	new
StarCoder code markers	❌ none	✅ 16 tokens	new
FIM	3 tokens	4 tokens	+ `<fim_pad>`
ChatML	❌	✅	new
Gemma chat	✅	✅	kept
Reasoning	❌	✅ `<think>`/`</think>`	new
Tool / multimodal	✅	✅	kept
Reserved slots	100	128	+28

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k-v2", use_fast=True)

Or with SentencePiece directly:

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
ids = sp.EncodeAsIds("გამარჯობა მსოფლიო")  # "Hello world" in Georgian

Files

tokenizer.model — SentencePiece BPE model (4.4 MB)
tokenizer.vocab — vocabulary listing
special_tokens_map.json — HF special tokens map
tokenizer_config.json — HF tokenizer config

Citation

Built for the OpenEuroLLM project (Horizon Europe). Source repo: https://github.com/OpenEuroLLM/tokenizer.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support