Upload folder using huggingface_hub

7098035 verified 2 days ago

5.22 kB

language:
  - en
  - de
  - fr
  - es
  - pt
  - it
  - nl
  - pl
  - ro
  - cs
  - sv
  - da
  - 'no'
  - fi
  - hu
  - hr
  - bg
  - tr
  - ca
  - ru
  - uk
  - sr
  - zh
  - ja
  - ko
  - ar
  - fa
  - he
  - hi
  - bn
  - th
  - vi
  - ka
  - hy
  - el
  - yi
  - ur
  - ta
  - te
  - gu
  - pa
  - ml
  - kn
  - am
  - si
  - my
  - km
  - mr
  - ne
  - or
  - bo
  - dv
  - eu
  - gl
  - gd
  - et
  - sk
  - lt
  - sl
  - lv
  - af
  - sq
  - sw
  - is
  - tl
  - cy
  - ga
  - br
  - la
  - mk
  - id
  - code
license: apache-2.0
library_name: tokenizers
tags:
  - tokenizer
  - bpe
  - multilingual
  - code
  - quartz
  - aenea
  - coding
  - python
  - flores
pipeline_tag: text-generation

QT_V.2 Code 114K — Multilingual Coding Tokenizer

Lowest total tokens on our 66-test field benchmark of any tokenizer at any vocab size. 114,688 vocabulary optimised for multilingual coding models. Trained with doubled code weight (37% of corpus) including 450K high-quality Python functions from CodeSearchNet. Beats Llama 3, Tekken, and Qwen 2.5 on total tokens while using 10–37% less vocabulary. Validated on FLORES-200 across 204 languages.

Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.

FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

Metric	QT Code 114K	QT 96K	QT 64K	Llama 3 (128K)	Tekken (131K)	Qwen 2.5 (152K)
Total tokens	13,007,924	12,961,617	13,592,357	16,764,198	14,421,539	15,425,680
Equity ratio	43.3×	31.6×	41.0×	118.6×	127.9×	77.7×
Mean fertility	4.03	3.94	4.18	5.72	5.34	4.91

QT Code 114K uses 22.4% fewer tokens than Llama 3 and 9.8% fewer than Tekken across all 204 FLORES languages — with 10–37% less vocabulary.

Key FLORES Languages (tok/word)

Language	QT Code	Llama 3	Tekken	Qwen 2.5
Japanese	32.1	38.9	41.3	35.8
Tibetan	46.5	149.8	168.4	98.0
Sinhala	3.58	11.37	16.60	9.17
Amharic	3.40	11.95	11.98	6.45
Georgian	3.46	15.47	3.93	8.33
Odia	4.10	16.90	18.30	13.65

Field Benchmark (66 Tests)

Metric	Value
Total tokens	3,314 (lowest of any tokenizer)
vs Llama 3 (128K)	41.2% fewer tokens
vs Tekken (131K)	23.8% fewer tokens
vs Qwen 2.5 (152K)	36.1% fewer tokens

Code Performance

Language	QT Code	QT 96K	QT 64K	Llama 3	Tekken	Qwen 2.5
Python	110	115	125	97	112	105
JavaScript	67	71	71	65	69	64
Rust	111	113	117	108	111	107

Python compression improved from 125 (64K) to 115 (96K) to 110 (Code 114K) — closing the gap versus Llama 3's 97 from 28.9% to 13.4%.

Category Totals (lower is better)

Category	QT Code	Llama 3	Tekken	Qwen 2.5
Natural Languages (20)	1,033	1,599	1,038	1,535
V1 Expansion (14)	662	1,758	1,092	1,509
V2 New Scripts (3)	188	692	740	523
Celtic / Brythonic (8)	312	391	341	384
Code (3)	288	270	292	276
TOTAL (66 tests)	3,314	5,639	4,347	5,183

When to Use This Variant

QT_V.2 Code 114K is designed for multilingual coding assistants and code generation models. It wins Natural Languages outright (1,033 — beating Tekken's 1,038) while offering competitive code compression. Ideal for models that must serve both code and diverse natural language users.

Also available: QT_V.2 64K (smallest embedding) · QT_V.2 96K (best all-round)

Usage

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)")
print(encoded.tokens)

Specifications

Spec	Value
Vocabulary	114,688
Languages	71 natural + 15 code (incl. CodeSearchNet)
Script families	26
Pretokenizer	Llama 3 regex
Arithmetic	Single-digit splitting
Max token length	15 chars
Avg token length	6.24 chars
Compression	3.60 chars/token

Training

Byte-level BPE with Llama 3 regex pretokenizer. Code-heavy corpus:

Category	Share	Sources
Wikipedia	37.3%	71 languages (wiki_ultra_clean v7.3)
Code	37.4%	14 languages + CodeSearchNet Python (450K functions)
Stack Exchange	25.3%	49 sites (se_ultra_clean v1)

Files

tokenizer.json · vocab.json · merges.txt · training_report.json

Contact

Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com

License

@misc{qt_v2_2026,
  title={QT_V.2: A Multilingual BPE Tokenizer Family},
  author={AENEA Global Ltd},
  year={2026},
  url={https://quartz.host},
}