OpenEuroLLM Tokenizer v2 (256k)

SentencePiece BPE tokenizer for the OpenEuroLLM flagship models. v2 is a full retrain of the v1 256k tokenizer on a larger, cleaner corpus with the language list driven from the canonical training-data-catalogue/languages file (no hardcoded list this time β€” see v1's Georgian-gap incident).

Highlights vs SOTA β€” multi-domain eval

Evaluation is a held-out 5-domain suite (8,600 samples total), designed to test more than Wikipedia prose. Each column is mean tokens-per-whitespace-word (lower = better).

Tokenizer Vocab Overall FLORES-200
(36 langs parallel)
Code
(Python)
Math
(LaTeX+GSM8K)
Chat
(ChatML)
PDFs
(5 langs)
OpenEuroLLM v2 256k (this model) 262,144 1.90 πŸ₯‡ 1.79 πŸ₯‡ 3.24 1.91 1.44 πŸ₯‡ 2.31
GPT-OSS 20B 200,000 2.07 2.07 2.62 1.64 1.54 2.01
OpenEuroLLM v2 128k 131,072 2.09 2.00 3.32 1.92 1.52 2.43
Gemma 3 4B 256,000 2.19 2.13 3.20 1.92 1.59 2.27
Mistral Nemo 131,072 2.23 2.20 2.84 1.92 1.62 2.26
EuroLLM 9B 128,000 2.30 2.21 3.79 2.02 1.57 2.48
OpenEuroLLM v1 256k (predecessor) 262,144 2.45 2.43 3.33 1.95 1.68 2.29
DeepSeek V3 128,000 2.47 2.51 2.83 1.65 1.65 2.13
OpenEuroLLM v1 128k 131,072 2.62 2.62 3.42 1.98 1.76 2.40
Llama 3.1 8B 128,256 2.68 2.78 2.60 πŸ₯‡ 1.65 πŸ₯‡ 1.65 2.18
Qwen 3 8B 151,936 2.70 2.78 2.64 1.90 1.51 2.35

Eval composition: 7,200 FLORES (36 langs Γ— 200 parallel sentences, held-out) Β· 500 Python (codeparrot) Β· 200 MATH+GSM8K Β· 200 OpenAssistant chat (ChatML-wrapped) Β· 500 FinePDFs (5 langs).

Summary of v2-256k vs the field:

  • #1 overall (1.90) β€” beats every SOTA tokenizer on the multi-domain average.
  • #1 on multilingual prose by a wide margin (FLORES 1.79 vs Gemma 2.13, GPT-OSS 2.07).
  • #1 on chat (1.44, ChatML tokens working as intended).
  • Competitive on math/PDF (within 0.3 of the leader).
  • Loses on Python code (3.24 vs Llama 2.60). Llama 3's tiktoken-based BPE is more code-aggressive; even the v2 whitespace tokens don't fully close that gap.

v2 vs v1: per-language deltas on FLORES-200

FLORES-200 has parallel sentences across all languages (semantically equivalent translations), so fertility differences here are pure tokenizer effect (no content drift). Same 256k and 128k models, bold = v2 better, English first then alphabetical.

Language v1 256k v2 256k Ξ”256k v1 128k v2 128k Ξ”128k
English (en) 1.24 1.21 βˆ’0.03 1.29 1.23 βˆ’0.06
Albanian (sq) 2.26 1.59 βˆ’0.67 2.44 1.76 βˆ’0.68
Basque (eu) 2.05 1.90 βˆ’0.15 2.28 2.12 βˆ’0.17
Bosnian (bs) 1.66 1.60 βˆ’0.07 1.84 1.78 βˆ’0.06
Bulgarian (bg) 1.78 1.87 +0.09 1.95 2.13 +0.18
Catalan (ca) 1.64 1.57 βˆ’0.07 1.77 1.70 βˆ’0.07
Croatian (hr) 1.72 1.63 βˆ’0.10 1.91 1.82 βˆ’0.09
Czech (cs) 1.55 1.79 +0.24 1.72 2.04 +0.31
Danish (da) 1.62 1.54 βˆ’0.07 1.76 1.69 βˆ’0.07
Dutch (nl) 1.62 1.53 βˆ’0.09 1.77 1.68 βˆ’0.09
Estonian (et) 2.15 2.04 βˆ’0.11 2.41 2.30 βˆ’0.11
Finnish (fi) 2.42 2.30 βˆ’0.11 2.71 2.59 βˆ’0.13
French (fr) 1.60 1.53 βˆ’0.06 1.74 1.67 βˆ’0.07
Galician (gl) 1.50 1.44 βˆ’0.06 1.64 1.58 βˆ’0.06
Georgian (ka) 22.93 2.83 βˆ’20.10 22.93 3.30 βˆ’19.63
German (de) 1.48 1.68 +0.19 1.61 1.86 +0.25
Greek (el) 2.24 2.12 βˆ’0.11 2.63 2.44 βˆ’0.20
Hungarian (hu) 2.15 2.06 βˆ’0.09 2.44 2.33 βˆ’0.12
Icelandic (is) 2.01 1.84 βˆ’0.17 2.21 2.05 βˆ’0.16
Irish (ga) 1.71 1.60 βˆ’0.11 1.91 1.79 βˆ’0.12
Italian (it) 1.35 1.51 +0.17 1.45 1.66 +0.21
Latvian (lv) 3.01 1.94 βˆ’1.07 3.18 2.20 βˆ’0.98
Lithuanian (lt) 2.04 1.99 βˆ’0.05 2.30 2.27 βˆ’0.03
Macedonian (mk) 1.81 1.89 +0.08 1.99 2.13 +0.13
Maltese (mt) 2.34 2.22 βˆ’0.12 2.59 2.47 βˆ’0.12
Norwegian (no) 1.55 1.52 βˆ’0.03 1.69 1.65 βˆ’0.04
Polish (pl) 1.74 1.90 +0.17 1.95 2.16 +0.21
Portuguese (pt) 1.52 1.45 βˆ’0.07 1.66 1.60 βˆ’0.06
Romanian (ro) 1.76 1.58 βˆ’0.17 1.92 1.75 βˆ’0.17
Serbian (sr) 2.03 1.97 βˆ’0.06 2.20 2.23 +0.02
Slovak (sk) 1.86 1.91 +0.05 2.04 2.12 +0.08
Slovene (sl) 1.78 1.74 βˆ’0.04 1.97 1.93 βˆ’0.04
Spanish (es) 1.47 1.41 βˆ’0.06 1.60 1.54 βˆ’0.05
Swedish (sv) 1.70 1.66 βˆ’0.04 1.85 1.81 βˆ’0.04
Turkish (tr) 2.12 1.92 βˆ’0.20 2.40 2.16 βˆ’0.24
Ukrainian (uk) 2.16 2.13 βˆ’0.03 2.42 2.46 +0.04
Average (36 catalogue langs) 2.43 1.79 βˆ’0.64 2.62 2.00 βˆ’0.62

v2-256k improves on 30/36 languages. Biggest wins: Georgian βˆ’20.10 (v1 was full byte-fallback; v2 has real script subwords), Latvian βˆ’1.07, Albanian βˆ’0.67. The few regressions (bg/cs/de/it/mk/pl/sk) are small (+0.05 to +0.24) and reflect that v2 spread vocabulary across more code/whitespace coverage. Note: lb/ru/cy aren't tested here β€” they're not in the catalogue and were dropped from v2's training scope.

Language coverage

36 catalogue languages: bg, bs, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, is, it, ka (new in v2), lt, lv, mk, mt, nl, no, pl, pt, ro, sk, sl, sq, sr, sv, tr, uk.

Languages removed from v2 vs v1: lb (Luxembourgish), ru (Russian), cy (Welsh) β€” not in the OpenEuroLLM catalogue.

Training details

  • Algorithm: SentencePiece BPE
  • Vocab size: 262,144 (2^18)
  • Normalization: identity (lossless)
  • Byte fallback: enabled
  • Corpus: 500 GB streamed/sampled from the OpenEuroLLM "baby" cycle release shards on LUMI (10 May 2026 packer), spanning dclm, nemotron-cc, finepdfs, finepdfs-edu, olmo-mix (wiki/arxiv/pes2o), starcoder, finemath-4plus, megamath (text-code-block, web-pro), hplt-3.0, nemotron-cc-opus-1.1, nemotron-cc-tower+-0.1.
  • Mix: ~70% English / ~7% code / ~5% math / ~18% other-langs (35 catalogue non-English languages, equal allocation).
  • Character coverage: 0.9995
  • Max piece length: 16

Special tokens

Core (locked at fixed IDs, in-vocab):

Token ID
<unk> 0
<bos> 1
<eos> 2
<pad> 3

This fixes a v1 bug where <pad> was tacked on at vocab_size+0 (262144), out-of-vocab for downstream code that asserts pad_token_id < vocab_size.

User-defined symbols (204 total) β€” new for v2

Whitespace family β€” better code efficiency (45 tokens)

Multi-character whitespace runs are reserved as single tokens so code with deep indentation doesn't burn a token per space. None of these were in v1; pretrained tokenizers without them tokenize a 32-space indent as 16+ tokens, v2 does it in 1.

Bucket Tokens Reserved IDs
Multi-space indents " " (2 spaces), " " (3), … " "*32 (32) 31 tokens
Tabs "\t", "\t\t", … "\t"*8 8 tokens
Multi-newline "\n\n", "\n\n\n", "\n\n\n\n" 3 tokens
Common code combos "\r\n", "\t\n", " \n" (4-space-then-newline) 3 tokens

Example:

>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k-v2")
>>> python_code = "def f():\n        return 1"  # 8-space indent
>>> len(tok.encode("def f():\n        return 1", add_special_tokens=False))
# v2: tokenizes the 8-space indent as ONE token
# Llama 3.1: tokenizes it as ~8 separate tokens (one per space-pair)

StarCoder-style code corpus markers (16 tokens)

Reserved for code-corpus-formatted inputs (<filename>foo.py\n... <file_sep>\n<reponame>OpenEuroLLM/x\n...) β€” biggest single contributor to StarCoder2's code quality:

<filename>, <reponame>, <file_sep>, <gh_stars>, <empty_output>, <issue_start>, <issue_comment>, <issue_closed>, <jupyter_start>, <jupyter_text>, <jupyter_code>, <jupyter_output>, <jupyter_script>, <commit_before>, <commit_msg>, <commit_after>.

v1 had none of these.

Fill-in-the-middle (FIM) β€” code completion (4 tokens)

Full 4-token FIM set for code-completion training. v1 had 3 (missing <fim_pad>).

<fim_prefix>, <fim_middle>, <fim_suffix>, <fim_pad>

Chat formats (4 tokens)

ChatML as the modern default; Gemma-style retained from v1 for compatibility.

Format Tokens Note
ChatML (primary) `< im_start
Gemma-style <start_of_turn>, <end_of_turn> Carried over from v1

Tool use (2 tokens)

<tool_call>, </tool_call> (carried over from v1).

Reasoning / chain-of-thought (2 tokens) β€” new for v2

<think>, </think> β€” DeepSeek-R1 / Qwen3 convention. Reserved up-front so future thinking-style post-training doesn't need to retokenize.

Multimodal (3 tokens)

<start_of_image>, <end_of_image>, <image_soft_token> (carried over from v1).

Reserved for future (128 tokens)

<unused_0> … <unused_127> β€” forward-compat slots. v1 had 100; v2 expands to 128 (Llama 3 reserves 256, this is a middle ground at ~0.05% of vocab).

v1 vs v2 special-tokens summary

Category v1 v2 Ξ”
Core specials in-vocab (unk/bos/eos/pad) ❌ pad at vocab+0 βœ… pad=3 fixed
Whitespace family ❌ none βœ… 45 tokens new
StarCoder code markers ❌ none βœ… 16 tokens new
FIM 3 tokens 4 tokens + <fim_pad>
ChatML ❌ βœ… new
Gemma chat βœ… βœ… kept
Reasoning ❌ βœ… <think>/</think> new
Tool / multimodal βœ… βœ… kept
Reserved slots 100 128 +28

Usage

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openeurollm/tokenizer-256k-v2", use_fast=True)

Or with SentencePiece directly:

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
ids = sp.EncodeAsIds("გამარჯობა αƒ›αƒ‘αƒαƒ€αƒšαƒ˜αƒ")  # "Hello world" in Georgian

Files

  • tokenizer.model β€” SentencePiece BPE model (4.4 MB)
  • tokenizer.vocab β€” vocabulary listing
  • special_tokens_map.json β€” HF special tokens map
  • tokenizer_config.json β€” HF tokenizer config

Citation

Built for the OpenEuroLLM project (Horizon Europe). Source repo: https://github.com/OpenEuroLLM/tokenizer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support