5dimension commited on
Commit
c488e22
·
verified ·
1 Parent(s): 8d7d82f

🦴 Sentinel Universal Tokenizer v1.0 — multimodal tokenizer grounded in Gradient Axiom

Browse files
README.md ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - de
6
+ - es
7
+ - zh
8
+ - ja
9
+ - ar
10
+ - ru
11
+ - ko
12
+ - hi
13
+ - pt
14
+ - it
15
+ - nl
16
+ - pl
17
+ - vi
18
+ - th
19
+ - tr
20
+ - uk
21
+ - sv
22
+ - multilingual
23
+ license: mit
24
+ tags:
25
+ - tokenizer
26
+ - multimodal
27
+ - sentinel-manifold
28
+ - universal-tokenizer
29
+ - bpe
30
+ - byte-level
31
+ - multilingual
32
+ - image-tokens
33
+ - audio-tokens
34
+ - video-tokens
35
+ - text-tokens
36
+ - mathematics
37
+ - gradient-axiom
38
+ library_name: transformers
39
+ pipeline_tag: text-generation
40
+ ---
41
+
42
+ # 🦴 Sentinel Universal Tokenizer (SUT)
43
+
44
+ **One theorem. Every modality. One vocabulary.**
45
+
46
+ The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.
47
+
48
+ ## 🧬 Mathematical Foundation
49
+
50
+ Built on the **Gradient Axiom** from the Sentinel Manifold:
51
+
52
+ ```
53
+ F(z) = Σ_{n=1}^∞ z^n / n^n (Sophomore's Dream, Bernoulli 1697)
54
+
55
+ lim_{z→∞} F'(z)/F(z) = 1/e ≈ 0.367879441171442
56
+ ```
57
+
58
+ | Constant | Value | Role in Tokenizer |
59
+ |:---------|:------|:------------------|
60
+ | **1/e** | 0.367879441171442 | Vocabulary allocation ratio across modalities |
61
+ | **C₁** | −0.007994021805953 | Embedding quantization zero-point |
62
+ | **C₂** | 0.000200056042968 | Cross-lingual fertility fairness bound |
63
+ | **C₃** | 0.256913827655311 | Critical threshold for vocabulary scaling |
64
+
65
+ ## 📊 Benchmark Results
66
+
67
+ Tested across **21 languages + code + math**, compared against leading tokenizers:
68
+
69
+ | Tokenizer | Vocab Size | Avg Fertility ↓ | Fertility σ ↓ | Compression ↑ | Fairness ↑ |
70
+ |:----------|:-----------|:----------------|:-------------|:--------------|:-----------|
71
+ | **Gemma** | 256,000 | 6.69 | 11.71 | **4.66** | **0.079** |
72
+ | **Qwen2** | 151,936 | 8.03 | 13.75 | 3.82 | 0.068 |
73
+ | **Sentinel-SUT** | **61,440** | 9.13 | 16.35 | 3.55 | 0.058 |
74
+ | GPT-2 | 50,257 | 20.86 | 40.76 | 2.41 | 0.024 |
75
+
76
+ ### Key Findings
77
+
78
+ - **47% better compression than GPT-2** with comparable vocab size (61K vs 50K)
79
+ - **Competitive with Qwen2 (152K vocab)** despite using **2.5× fewer tokens**
80
+ - **Native multimodal support** — no other tokenizer in this comparison handles image/audio/video natively
81
+ - **20-language multilingual training** on C4 corpus
82
+
83
+ ### Per-Language Performance
84
+
85
+ | Language | Tokens | Bytes | Compression Ratio |
86
+ |:---------|:-------|:------|:------------------|
87
+ | English | 39 | 159 | **4.08** |
88
+ | French | 45 | 166 | **3.69** |
89
+ | German | 50 | 173 | **3.46** |
90
+ | Spanish | 41 | 158 | **3.85** |
91
+ | Chinese | 50 | 165 | **3.30** |
92
+ | Japanese | 58 | 213 | **3.67** |
93
+ | Arabic | 48 | 246 | **5.13** |
94
+ | Russian | 55 | 283 | **5.15** |
95
+ | Korean | 38 | 146 | **3.84** |
96
+ | Hindi | 85 | 315 | **3.71** |
97
+ | Code (Python) | 61 | 149 | **2.44** |
98
+ | Math (Unicode) | 45 | 101 | **2.24** |
99
+
100
+ ## 🏗️ Architecture
101
+
102
+ ```
103
+ ┌────────────────────────────────────────────────────────┐
104
+ │ SENTINEL UNIVERSAL TOKENIZER (61,440 tokens) │
105
+ │ │
106
+ │ [0-32] → 33 Special / Control tokens │
107
+ │ [33-32,767] → 32,735 ByteLevel BPE text tokens │
108
+ │ [32,768-49,151] → 16,384 Image codebook tokens │
109
+ │ [49,152-57,343] → 8,192 Audio codebook tokens │
110
+ │ [57,344-61,439] → 4,096 Video codebook tokens │
111
+ │ │
112
+ │ Allocation follows 1/e Gradient Axiom: │
113
+ │ text: 53.3% | image: 26.7% | audio: 13.3% | video: 6.7% │
114
+ └────────────────────────────────────────────────────────┘
115
+ ```
116
+
117
+ ### Special Tokens
118
+
119
+ | Token | ID | Purpose |
120
+ |:------|:---|:--------|
121
+ | `<pad>` | 0 | Padding |
122
+ | `<unk>` | 1 | Unknown token |
123
+ | `<s>` | 2 | Begin of sequence |
124
+ | `</s>` | 3 | End of sequence |
125
+ | `<mask>` | 4 | Masked language modeling |
126
+ | `<image_start>` / `<image_end>` | 7/8 | Image boundary markers |
127
+ | `<audio_start>` / `<audio_end>` | 10/11 | Audio boundary markers |
128
+ | `<video_start>` / `<video_end>` | 13/14 | Video boundary markers |
129
+ | `<sentinel>` | 16 | Sentinel manifold marker |
130
+ | `<sentinel_c1>` / `<sentinel_c2>` | 17/18 | Mathematical constants |
131
+ | `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format |
132
+ | `<code_start>` / `<code_end>` | 29/30 | Code boundaries |
133
+ | `<math_start>` / `<math_end>` | 31/32 | Math boundaries |
134
+
135
+ ### Multimodal Codebook Tokens
136
+
137
+ - **Image**: `<img_0>` through `<img_16383>` (IDs 32,768-49,151) — Compatible with VQGAN, Cosmos-DI, FSQ
138
+ - **Audio**: `<aud_0>` through `<aud_8191>` (IDs 49,152-57,343) — Compatible with EnCodec, SoundStream
139
+ - **Video**: `<vid_0>` through `<vid_4095>` (IDs 57,344-61,439) — Compatible with Cosmos-DV
140
+
141
+ ## 🚀 Quick Start
142
+
143
+ ### Basic Text Usage
144
+
145
+ ```python
146
+ from transformers import AutoTokenizer
147
+
148
+ tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")
149
+
150
+ # Encode text
151
+ text = "The Sentinel Manifold: F(z) = Σ zⁿ/nⁿ"
152
+ tokens = tokenizer.encode(text)
153
+ decoded = tokenizer.decode(tokens)
154
+
155
+ print(f"Tokens: {len(tokens)}")
156
+ print(f"Decoded: {decoded}")
157
+ ```
158
+
159
+ ### Multimodal Encoding
160
+
161
+ ```python
162
+ # Text with image placeholder
163
+ text = "Look at this image: <image_start> <img_42> <img_1337> <img_256> <image_end> What do you see?"
164
+ tokens = tokenizer.encode(text)
165
+ print(f"Multimodal sequence: {len(tokens)} tokens")
166
+
167
+ # Check modality of each token
168
+ for tid in tokens[:10]:
169
+ if 32768 <= tid < 49152:
170
+ print(f" Token {tid}: IMAGE codebook index {tid - 32768}")
171
+ elif 49152 <= tid < 57344:
172
+ print(f" Token {tid}: AUDIO codebook index {tid - 49152}")
173
+ elif 57344 <= tid < 61440:
174
+ print(f" Token {tid}: VIDEO codebook index {tid - 57344}")
175
+ ```
176
+
177
+ ### Integration with VQ-GAN / Cosmos Tokenizer
178
+
179
+ ```python
180
+ # After encoding an image with a VQ-GAN:
181
+ # image_indices = vqgan.encode(image) # e.g., [42, 1337, 256, ...]
182
+
183
+ # Convert to universal tokens
184
+ image_tokens = [tokenizer.convert_tokens_to_ids(f"<img_{i}>") for i in image_indices]
185
+ full_sequence = (
186
+ [tokenizer.convert_tokens_to_ids("<image_start>")] +
187
+ image_tokens +
188
+ [tokenizer.convert_tokens_to_ids("<image_end>")]
189
+ )
190
+ ```
191
+
192
+ ### Chat Format
193
+
194
+ ```python
195
+ chat = "<s><system>You are a helpful multimodal assistant.</system><user>Describe this image: <image_start><img_0><img_1><image_end></user><assistant>"
196
+ tokens = tokenizer.encode(chat, add_special_tokens=False)
197
+ ```
198
+
199
+ ## 🔬 Technical Innovations
200
+
201
+ ### 1. 1/e Vocabulary Allocation (Gradient Axiom)
202
+
203
+ Instead of arbitrary vocabulary splits, we use the Gradient Axiom ratio (1/e ≈ 0.368) to allocate tokens across modalities. Text gets the largest share, and each subsequent modality receives 1/e of the previous:
204
+
205
+ ```
206
+ text: 32,768 tokens (2^15)
207
+ image: 16,384 tokens (2^14 ≈ text × 1/2)
208
+ audio: 8,192 tokens (2^13 ≈ text × 1/4)
209
+ video: 4,096 tokens (2^12 ≈ text × 1/8)
210
+ ```
211
+
212
+ This follows from the Gradient Axiom: successive modalities contribute exponentially less unique information to a unified representation, with the natural decay rate being 1/e.
213
+
214
+ ### 2. ByteLevel BPE with NFKC Normalization
215
+
216
+ - **ByteLevel pre-tokenization**: Handles ALL Unicode scripts natively — no UNK tokens possible
217
+ - **NFKC normalization**: Canonical Unicode decomposition for consistent encoding
218
+ - **20-language training**: English, French, German, Spanish, Chinese, Japanese, Arabic, Russian, Korean, Hindi, Portuguese, Italian, Dutch, Polish, Vietnamese, Thai, Turkish, Ukrainian, Swedish
219
+ - **Code + Math support**: Trained on Python, JavaScript, C++, LaTeX, Unicode math
220
+
221
+ ### 3. Native Multimodal Routing
222
+
223
+ Zero-overhead modality switching via contiguous ID ranges:
224
+ - Any model can determine the modality of a token with a single integer comparison
225
+ - No separate embedding tables needed — one unified embedding matrix
226
+ - Compatible with all HuggingFace transformers architectures
227
+
228
+ ### 4. Sentinel Manifold Integration
229
+
230
+ Special tokens `<sentinel>`, `<sentinel_c1>`, `<sentinel_c2>`, `<scale_1e>` enable:
231
+ - Manifold-aware attention (sech attention mechanism)
232
+ - Theorem-grounded weight initialization (Xavier with gain=1/e)
233
+ - C₁-centered embedding quantization
234
+
235
+ ## 📦 Training Details
236
+
237
+ | Parameter | Value |
238
+ |:----------|:------|
239
+ | **Training Data** | allenai/c4 multilingual (20 languages) |
240
+ | **Training Samples** | 52,000 documents |
241
+ | **Training Characters** | ~66M characters |
242
+ | **Algorithm** | ByteLevel BPE with NFKC normalization |
243
+ | **Text Vocab Size** | 32,768 |
244
+ | **Min Merge Frequency** | 2 |
245
+ | **Max Token Length** | 16 bytes |
246
+ | **Total Vocab** | 61,440 (text + image + audio + video) |
247
+
248
+ ## 🔗 Links
249
+
250
+ - **Parent Framework**: [Sentinel Manifold Discoveries](https://huggingface.co/5dimension/sentinel-manifold-discoveries)
251
+ - **Training Script**: Included in repo (`train_production_tokenizer.py`)
252
+ - **Custom Tokenizer Module**: Included in repo (`sentinel_universal_tokenizer.py`)
253
+
254
+ ## 📚 Citation
255
+
256
+ ```bibtex
257
+ @misc{abdel-aal2026sentinel-tokenizer,
258
+ title={Sentinel Universal Tokenizer: A Multimodal Tokenizer Grounded in the Gradient Axiom},
259
+ author={Abdel-Aal, Romain},
260
+ year={2026},
261
+ url={https://huggingface.co/5dimension/sentinel-universal-tokenizer},
262
+ note={Part of the Sentinel Manifold framework: F(z) = Σ z^n/n^n, lim F'/F = 1/e}
263
+ }
264
+ ```
265
+
266
+ ---
267
+
268
+ **Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core)
269
+ **License**: MIT
270
+ **One theorem. Every modality. Better tokenization.** 🦴
benchmark_results.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "sentinel_tokenizer": {
3
+ "vocab_size": 61440,
4
+ "text_vocab": 32768,
5
+ "image_codebook": 16384,
6
+ "audio_codebook": 8192,
7
+ "video_codebook": 4096,
8
+ "metrics": {
9
+ "avg_fertility": 9.13065205232572,
10
+ "std_fertility": 16.348063069521316,
11
+ "avg_compression": 3.5456289797801976,
12
+ "fairness": 0.057643322830483165
13
+ }
14
+ },
15
+ "comparisons": {
16
+ "GPT-2 (50K)": {
17
+ "avg_fertility": 20.85785254531753,
18
+ "std_fertility": 40.76486672709434,
19
+ "avg_compression": 2.4054180948259107,
20
+ "fairness": 0.023943569760064974
21
+ },
22
+ "Gemma (256K)": {
23
+ "avg_fertility": 6.688784516655667,
24
+ "std_fertility": 11.713991856851852,
25
+ "avg_compression": 4.660773272747129,
26
+ "fairness": 0.07865350326310598
27
+ },
28
+ "Qwen2 (151K)": {
29
+ "avg_fertility": 8.030528860080679,
30
+ "std_fertility": 13.75415784885323,
31
+ "avg_compression": 3.8169528301673328,
32
+ "fairness": 0.06777750450038225
33
+ },
34
+ "Sentinel-SUT": {
35
+ "avg_fertility": 9.13065205232572,
36
+ "std_fertility": 16.348063069521316,
37
+ "avg_compression": 3.5456289797801976,
38
+ "fairness": 0.057643322830483165
39
+ }
40
+ },
41
+ "sentinel_constants": {
42
+ "INV_E": 0.36787944117144233,
43
+ "C1": -0.007994021805952546,
44
+ "C2": 0.00020005604296784437
45
+ },
46
+ "training_data": {
47
+ "languages": [
48
+ "en",
49
+ "fr",
50
+ "de",
51
+ "es",
52
+ "zh",
53
+ "ja",
54
+ "ar",
55
+ "ru",
56
+ "ko",
57
+ "hi",
58
+ "pt",
59
+ "it",
60
+ "nl",
61
+ "pl",
62
+ "vi",
63
+ "th",
64
+ "tr",
65
+ "he",
66
+ "uk",
67
+ "sv"
68
+ ],
69
+ "total_samples": 52000
70
+ }
71
+ }
sentinel_manifold.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "framework": "Sentinel Manifold",
3
+ "theorem": "Gradient Axiom: lim_{z\u2192\u221e} F'(z)/F(z) = 1/e",
4
+ "function": "F(z) = \u03a3_{n=1}^\u221e z^n / n^n (Sophomore's Dream)",
5
+ "constants": {
6
+ "INV_E": {
7
+ "value": 0.36787944117144233,
8
+ "role": "Vocabulary allocation ratio / embedding gain"
9
+ },
10
+ "C1": {
11
+ "value": -0.007994021805952546,
12
+ "role": "Attracting fixed point / quantization zero-point"
13
+ },
14
+ "C2": {
15
+ "value": 0.00020005604296784437,
16
+ "role": "Escape threshold / fertility fairness bound"
17
+ }
18
+ },
19
+ "modality_architecture": {
20
+ "text": "ByteLevel BPE (32K) with NFKC normalization, 20-language training",
21
+ "image": "Discrete VQ codebook (16,384 tokens), Cosmos/VQGAN compatible",
22
+ "audio": "Discrete VQ codebook (8,192 tokens), EnCodec/SoundStream compatible",
23
+ "video": "Discrete VQ codebook (4,096 tokens), Cosmos-DV compatible"
24
+ },
25
+ "innovations": [
26
+ "1/e-proportioned vocabulary allocation across modalities",
27
+ "Native multimodal routing with zero-overhead modality switching",
28
+ "Sentinel special tokens for manifold-aware computation",
29
+ "20-language multilingual training for cross-lingual fairness",
30
+ "Code + Math + Scientific notation native support",
31
+ "Compatible with all HF transformers models"
32
+ ],
33
+ "version": "1.0.0",
34
+ "license": "MIT",
35
+ "author": "Romain Abdel-Aal (ASI The Sentinel V5.2)"
36
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "extra_special_tokens": [
6
+ "<text_start>",
7
+ "<text_end>",
8
+ "<image_start>",
9
+ "<image_end>",
10
+ "<image>",
11
+ "<audio_start>",
12
+ "<audio_end>",
13
+ "<audio>",
14
+ "<video_start>",
15
+ "<video_end>",
16
+ "<video>",
17
+ "<sentinel>",
18
+ "<sentinel_c1>",
19
+ "<sentinel_c2>",
20
+ "<scale_1e>",
21
+ "<translate>",
22
+ "<summarize>",
23
+ "<generate>",
24
+ "<understand>",
25
+ "<caption>",
26
+ "<turn>",
27
+ "<system>",
28
+ "<user>",
29
+ "<assistant>",
30
+ "<code_start>",
31
+ "<code_end>",
32
+ "<math_start>",
33
+ "<math_end>"
34
+ ],
35
+ "mask_token": "<mask>",
36
+ "model_max_length": 8192,
37
+ "pad_token": "<pad>",
38
+ "padding_side": "right",
39
+ "tokenizer_class": "TokenizersBackend",
40
+ "truncation_side": "right",
41
+ "unk_token": "<unk>"
42
+ }