QT_V.2_Code_114K / README.md
JamesQuartz's picture
Upload folder using huggingface_hub
7098035 verified
metadata
language:
  - en
  - de
  - fr
  - es
  - pt
  - it
  - nl
  - pl
  - ro
  - cs
  - sv
  - da
  - 'no'
  - fi
  - hu
  - hr
  - bg
  - tr
  - ca
  - ru
  - uk
  - sr
  - zh
  - ja
  - ko
  - ar
  - fa
  - he
  - hi
  - bn
  - th
  - vi
  - ka
  - hy
  - el
  - yi
  - ur
  - ta
  - te
  - gu
  - pa
  - ml
  - kn
  - am
  - si
  - my
  - km
  - mr
  - ne
  - or
  - bo
  - dv
  - eu
  - gl
  - gd
  - et
  - sk
  - lt
  - sl
  - lv
  - af
  - sq
  - sw
  - is
  - tl
  - cy
  - ga
  - br
  - la
  - mk
  - id
  - code
license: apache-2.0
library_name: tokenizers
tags:
  - tokenizer
  - bpe
  - multilingual
  - code
  - quartz
  - aenea
  - coding
  - python
  - flores
pipeline_tag: text-generation

QT_V.2 Code 114K — Multilingual Coding Tokenizer

Lowest total tokens on our 66-test field benchmark of any tokenizer at any vocab size. 114,688 vocabulary optimised for multilingual coding models. Trained with doubled code weight (37% of corpus) including 450K high-quality Python functions from CodeSearchNet. Beats Llama 3, Tekken, and Qwen 2.5 on total tokens while using 10–37% less vocabulary. Validated on FLORES-200 across 204 languages.

Part of the QT_V.2 tokenizer family by Quartz Data Infrastructure, the open data layer behind AENEA.

FLORES-200 Results (204 Languages · 1,012 Parallel Sentences)

Metric QT Code 114K QT 96K QT 64K Llama 3 (128K) Tekken (131K) Qwen 2.5 (152K)
Total tokens 13,007,924 12,961,617 13,592,357 16,764,198 14,421,539 15,425,680
Equity ratio 43.3× 31.6× 41.0× 118.6× 127.9× 77.7×
Mean fertility 4.03 3.94 4.18 5.72 5.34 4.91

QT Code 114K uses 22.4% fewer tokens than Llama 3 and 9.8% fewer than Tekken across all 204 FLORES languages — with 10–37% less vocabulary.

Key FLORES Languages (tok/word)

Language QT Code Llama 3 Tekken Qwen 2.5
Japanese 32.1 38.9 41.3 35.8
Tibetan 46.5 149.8 168.4 98.0
Sinhala 3.58 11.37 16.60 9.17
Amharic 3.40 11.95 11.98 6.45
Georgian 3.46 15.47 3.93 8.33
Odia 4.10 16.90 18.30 13.65

Field Benchmark (66 Tests)

Metric Value
Total tokens 3,314 (lowest of any tokenizer)
vs Llama 3 (128K) 41.2% fewer tokens
vs Tekken (131K) 23.8% fewer tokens
vs Qwen 2.5 (152K) 36.1% fewer tokens

Code Performance

Language QT Code QT 96K QT 64K Llama 3 Tekken Qwen 2.5
Python 110 115 125 97 112 105
JavaScript 67 71 71 65 69 64
Rust 111 113 117 108 111 107

Python compression improved from 125 (64K) to 115 (96K) to 110 (Code 114K) — closing the gap versus Llama 3's 97 from 28.9% to 13.4%.

Category Totals (lower is better)

Category QT Code Llama 3 Tekken Qwen 2.5
Natural Languages (20) 1,033 1,599 1,038 1,535
V1 Expansion (14) 662 1,758 1,092 1,509
V2 New Scripts (3) 188 692 740 523
Celtic / Brythonic (8) 312 391 341 384
Code (3) 288 270 292 276
TOTAL (66 tests) 3,314 5,639 4,347 5,183

When to Use This Variant

QT_V.2 Code 114K is designed for multilingual coding assistants and code generation models. It wins Natural Languages outright (1,033 — beating Tekken's 1,038) while offering competitive code compression. Ideal for models that must serve both code and diverse natural language users.

Also available: QT_V.2 64K (smallest embedding) · QT_V.2 96K (best all-round)

Usage

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
encoded = tok.encode("def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)")
print(encoded.tokens)

Specifications

Spec Value
Vocabulary 114,688
Languages 71 natural + 15 code (incl. CodeSearchNet)
Script families 26
Pretokenizer Llama 3 regex
Arithmetic Single-digit splitting
Max token length 15 chars
Avg token length 6.24 chars
Compression 3.60 chars/token

Training

Byte-level BPE with Llama 3 regex pretokenizer. Code-heavy corpus:

Category Share Sources
Wikipedia 37.3% 71 languages (wiki_ultra_clean v7.3)
Code 37.4% 14 languages + CodeSearchNet Python (450K functions)
Stack Exchange 25.3% 49 sites (se_ultra_clean v1)

Files

tokenizer.json · vocab.json · merges.txt · training_report.json

Contact

Open-source: quartzopensource@gmail.com Commercial licensing & enterprise: commercial@aeneaglobal.com

License

Apache 2.0 — Copyright 2025-2026 AENEA Global Ltd

@misc{qt_v2_2026,
  title={QT_V.2: A Multilingual BPE Tokenizer Family},
  author={AENEA Global Ltd},
  year={2026},
  url={https://quartz.host},
}