Instructions to use 5dimension/sentinel-universal-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 5dimension/sentinel-universal-tokenizer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="5dimension/sentinel-universal-tokenizer")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("5dimension/sentinel-universal-tokenizer", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 5dimension/sentinel-universal-tokenizer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "5dimension/sentinel-universal-tokenizer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "5dimension/sentinel-universal-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/5dimension/sentinel-universal-tokenizer

SGLang

How to use 5dimension/sentinel-universal-tokenizer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "5dimension/sentinel-universal-tokenizer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "5dimension/sentinel-universal-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "5dimension/sentinel-universal-tokenizer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "5dimension/sentinel-universal-tokenizer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use 5dimension/sentinel-universal-tokenizer with Docker Model Runner:
```
docker model run hf.co/5dimension/sentinel-universal-tokenizer
```

5dimension commited on 14 days ago

Commit

c488e22

verified ·

1 Parent(s): 8d7d82f

🦴 Sentinel Universal Tokenizer v1.0 — multimodal tokenizer grounded in Gradient Axiom

Browse files

Files changed (5) hide show

README.md +270 -0
benchmark_results.json +71 -0
sentinel_manifold.json +36 -0
tokenizer.json +0 -0
tokenizer_config.json +42 -0

README.md ADDED Viewed

	@@ -0,0 +1,270 @@

+---
+language:
+- en
+- fr
+- de
+- es
+- zh
+- ja
+- ar
+- ru
+- ko
+- hi
+- pt
+- it
+- nl
+- pl
+- vi
+- th
+- tr
+- uk
+- sv
+- multilingual
+license: mit
+tags:
+- tokenizer
+- multimodal
+- sentinel-manifold
+- universal-tokenizer
+- bpe
+- byte-level
+- multilingual
+- image-tokens
+- audio-tokens
+- video-tokens
+- text-tokens
+- mathematics
+- gradient-axiom
+library_name: transformers
+pipeline_tag: text-generation
+---
+# 🦴 Sentinel Universal Tokenizer (SUT)
+**One theorem. Every modality. One vocabulary.**
+The Sentinel Universal Tokenizer is a multimodal tokenizer that handles **text, images, audio, and video** in a unified 61,440-token vocabulary, grounded in the Sentinel Manifold mathematics.
+## 🧬 Mathematical Foundation
+Built on the **Gradient Axiom** from the Sentinel Manifold:
+```
+F(z) = Σ_{n=1}^∞ z^n / n^n    (Sophomore's Dream, Bernoulli 1697)
+lim_{z→∞} F'(z)/F(z) = 1/e ≈ 0.367879441171442
+```
+| Constant | Value | Role in Tokenizer |
+|:---------|:------|:------------------|
+| **1/e** | 0.367879441171442 | Vocabulary allocation ratio across modalities |
+| **C₁** | −0.007994021805953 | Embedding quantization zero-point |
+| **C₂** | 0.000200056042968 | Cross-lingual fertility fairness bound |
+| **C₃** | 0.256913827655311 | Critical threshold for vocabulary scaling |
+## 📊 Benchmark Results
+Tested across **21 languages + code + math**, compared against leading tokenizers:
+| Tokenizer | Vocab Size | Avg Fertility ↓ | Fertility σ ↓ | Compression ↑ | Fairness ↑ |
+|:----------|:-----------|:----------------|:-------------|:--------------|:-----------|
+| **Gemma** | 256,000 | 6.69 | 11.71 | **4.66** | **0.079** |
+| **Qwen2** | 151,936 | 8.03 | 13.75 | 3.82 | 0.068 |
+| **Sentinel-SUT** | **61,440** | 9.13 | 16.35 | 3.55 | 0.058 |
+| GPT-2 | 50,257 | 20.86 | 40.76 | 2.41 | 0.024 |
+### Key Findings
+- **47% better compression than GPT-2** with comparable vocab size (61K vs 50K)
+- **Competitive with Qwen2 (152K vocab)** despite using **2.5× fewer tokens**
+- **Native multimodal support** — no other tokenizer in this comparison handles image/audio/video natively
+- **20-language multilingual training** on C4 corpus
+### Per-Language Performance
+| Language | Tokens | Bytes | Compression Ratio |
+|:---------|:-------|:------|:------------------|
+| English | 39 | 159 | **4.08** |
+| French | 45 | 166 | **3.69** |
+| German | 50 | 173 | **3.46** |
+| Spanish | 41 | 158 | **3.85** |
+| Chinese | 50 | 165 | **3.30** |
+| Japanese | 58 | 213 | **3.67** |
+| Arabic | 48 | 246 | **5.13** |
+| Russian | 55 | 283 | **5.15** |
+| Korean | 38 | 146 | **3.84** |
+| Hindi | 85 | 315 | **3.71** |
+| Code (Python) | 61 | 149 | **2.44** |
+| Math (Unicode) | 45 | 101 | **2.24** |
+## 🏗️ Architecture
+```
+┌────────────────────────────────────────────────────────┐
+│  SENTINEL UNIVERSAL TOKENIZER (61,440 tokens)          │
+│                                                         │
+│  [0-32]          → 33 Special / Control tokens         │
+│  [33-32,767]     → 32,735 ByteLevel BPE text tokens   │
+│  [32,768-49,151] → 16,384 Image codebook tokens       │
+│  [49,152-57,343] → 8,192 Audio codebook tokens        │
+│  [57,344-61,439] → 4,096 Video codebook tokens        │
+│                                                         │
+│  Allocation follows 1/e Gradient Axiom:                │
+│  text: 53.3% | image: 26.7% | audio: 13.3% | video: 6.7% │
+└────────────────────────────────────────────────────────┘
+```
+### Special Tokens
+| Token | ID | Purpose |
+|:------|:---|:--------|
+| `<pad>` | 0 | Padding |
+| `<unk>` | 1 | Unknown token |
+| `<s>` | 2 | Begin of sequence |
+| `</s>` | 3 | End of sequence |
+| `<mask>` | 4 | Masked language modeling |
+| `<image_start>` / `<image_end>` | 7/8 | Image boundary markers |
+| `<audio_start>` / `<audio_end>` | 10/11 | Audio boundary markers |
+| `<video_start>` / `<video_end>` | 13/14 | Video boundary markers |
+| `<sentinel>` | 16 | Sentinel manifold marker |
+| `<sentinel_c1>` / `<sentinel_c2>` | 17/18 | Mathematical constants |
+| `<system>` / `<user>` / `<assistant>` | 26/27/28 | Chat format |
+| `<code_start>` / `<code_end>` | 29/30 | Code boundaries |
+| `<math_start>` / `<math_end>` | 31/32 | Math boundaries |
+### Multimodal Codebook Tokens
+- **Image**: `<img_0>` through `<img_16383>` (IDs 32,768-49,151) — Compatible with VQGAN, Cosmos-DI, FSQ
+- **Audio**: `<aud_0>` through `<aud_8191>` (IDs 49,152-57,343) — Compatible with EnCodec, SoundStream
+- **Video**: `<vid_0>` through `<vid_4095>` (IDs 57,344-61,439) — Compatible with Cosmos-DV
+## 🚀 Quick Start
+### Basic Text Usage
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("5dimension/sentinel-universal-tokenizer")
+# Encode text
+text = "The Sentinel Manifold: F(z) = Σ zⁿ/nⁿ"
+tokens = tokenizer.encode(text)
+decoded = tokenizer.decode(tokens)
+print(f"Tokens: {len(tokens)}")
+print(f"Decoded: {decoded}")
+```
+### Multimodal Encoding
+```python
+# Text with image placeholder
+text = "Look at this image: <image_start> <img_42> <img_1337> <img_256> <image_end> What do you see?"
+tokens = tokenizer.encode(text)
+print(f"Multimodal sequence: {len(tokens)} tokens")
+# Check modality of each token
+for tid in tokens[:10]:
+    if 32768 <= tid < 49152:
+        print(f"  Token {tid}: IMAGE codebook index {tid - 32768}")
+    elif 49152 <= tid < 57344:
+        print(f"  Token {tid}: AUDIO codebook index {tid - 49152}")
+    elif 57344 <= tid < 61440:
+        print(f"  Token {tid}: VIDEO codebook index {tid - 57344}")
+```
+### Integration with VQ-GAN / Cosmos Tokenizer
+```python
+# After encoding an image with a VQ-GAN:
+# image_indices = vqgan.encode(image)  # e.g., [42, 1337, 256, ...]
+# Convert to universal tokens
+image_tokens = [tokenizer.convert_tokens_to_ids(f"<img_{i}>") for i in image_indices]
+full_sequence = (
+    [tokenizer.convert_tokens_to_ids("<image_start>")] +
+    image_tokens +
+    [tokenizer.convert_tokens_to_ids("<image_end>")]
+)
+```
+### Chat Format
+```python
+chat = "<s><system>You are a helpful multimodal assistant.</system><user>Describe this image: <image_start><img_0><img_1><image_end></user><assistant>"
+tokens = tokenizer.encode(chat, add_special_tokens=False)
+```
+## 🔬 Technical Innovations
+### 1. 1/e Vocabulary Allocation (Gradient Axiom)
+Instead of arbitrary vocabulary splits, we use the Gradient Axiom ratio (1/e ≈ 0.368) to allocate tokens across modalities. Text gets the largest share, and each subsequent modality receives 1/e of the previous:
+```
+text:  32,768 tokens (2^15)
+image: 16,384 tokens (2^14 ≈ text × 1/2)
+audio:  8,192 tokens (2^13 ≈ text × 1/4)
+video:  4,096 tokens (2^12 ≈ text × 1/8)
+```
+This follows from the Gradient Axiom: successive modalities contribute exponentially less unique information to a unified representation, with the natural decay rate being 1/e.
+### 2. ByteLevel BPE with NFKC Normalization
+- **ByteLevel pre-tokenization**: Handles ALL Unicode scripts natively — no UNK tokens possible
+- **NFKC normalization**: Canonical Unicode decomposition for consistent encoding
+- **20-language training**: English, French, German, Spanish, Chinese, Japanese, Arabic, Russian, Korean, Hindi, Portuguese, Italian, Dutch, Polish, Vietnamese, Thai, Turkish, Ukrainian, Swedish
+- **Code + Math support**: Trained on Python, JavaScript, C++, LaTeX, Unicode math
+### 3. Native Multimodal Routing
+Zero-overhead modality switching via contiguous ID ranges:
+- Any model can determine the modality of a token with a single integer comparison
+- No separate embedding tables needed — one unified embedding matrix
+- Compatible with all HuggingFace transformers architectures
+### 4. Sentinel Manifold Integration
+Special tokens `<sentinel>`, `<sentinel_c1>`, `<sentinel_c2>`, `<scale_1e>` enable:
+- Manifold-aware attention (sech attention mechanism)
+- Theorem-grounded weight initialization (Xavier with gain=1/e)
+- C₁-centered embedding quantization
+## 📦 Training Details
+| Parameter | Value |
+|:----------|:------|
+| **Training Data** | allenai/c4 multilingual (20 languages) |
+| **Training Samples** | 52,000 documents |
+| **Training Characters** | ~66M characters |
+| **Algorithm** | ByteLevel BPE with NFKC normalization |
+| **Text Vocab Size** | 32,768 |
+| **Min Merge Frequency** | 2 |
+| **Max Token Length** | 16 bytes |
+| **Total Vocab** | 61,440 (text + image + audio + video) |
+## 🔗 Links
+- **Parent Framework**: [Sentinel Manifold Discoveries](https://huggingface.co/5dimension/sentinel-manifold-discoveries)
+- **Training Script**: Included in repo (`train_production_tokenizer.py`)
+- **Custom Tokenizer Module**: Included in repo (`sentinel_universal_tokenizer.py`)
+## 📚 Citation
+```bibtex
+@misc{abdel-aal2026sentinel-tokenizer,
+  title={Sentinel Universal Tokenizer: A Multimodal Tokenizer Grounded in the Gradient Axiom},
+  author={Abdel-Aal, Romain},
+  year={2026},
+  url={https://huggingface.co/5dimension/sentinel-universal-tokenizer},
+  note={Part of the Sentinel Manifold framework: F(z) = Σ z^n/n^n, lim F'/F = 1/e}
+}
+```
+---
+**Built by**: Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core)
+**License**: MIT
+**One theorem. Every modality. Better tokenization.** 🦴

benchmark_results.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "sentinel_tokenizer": {
+    "vocab_size": 61440,
+    "text_vocab": 32768,
+    "image_codebook": 16384,
+    "audio_codebook": 8192,
+    "video_codebook": 4096,
+    "metrics": {
+      "avg_fertility": 9.13065205232572,
+      "std_fertility": 16.348063069521316,
+      "avg_compression": 3.5456289797801976,
+      "fairness": 0.057643322830483165
+    }
+  },
+  "comparisons": {
+    "GPT-2 (50K)": {
+      "avg_fertility": 20.85785254531753,
+      "std_fertility": 40.76486672709434,
+      "avg_compression": 2.4054180948259107,
+      "fairness": 0.023943569760064974
+    },
+    "Gemma (256K)": {
+      "avg_fertility": 6.688784516655667,
+      "std_fertility": 11.713991856851852,
+      "avg_compression": 4.660773272747129,
+      "fairness": 0.07865350326310598
+    },
+    "Qwen2 (151K)": {
+      "avg_fertility": 8.030528860080679,
+      "std_fertility": 13.75415784885323,
+      "avg_compression": 3.8169528301673328,
+      "fairness": 0.06777750450038225
+    },
+    "Sentinel-SUT": {
+      "avg_fertility": 9.13065205232572,
+      "std_fertility": 16.348063069521316,
+      "avg_compression": 3.5456289797801976,
+      "fairness": 0.057643322830483165
+    }
+  },
+  "sentinel_constants": {
+    "INV_E": 0.36787944117144233,
+    "C1": -0.007994021805952546,
+    "C2": 0.00020005604296784437
+  },
+  "training_data": {
+    "languages": [
+      "en",
+      "fr",
+      "de",
+      "es",
+      "zh",
+      "ja",
+      "ar",
+      "ru",
+      "ko",
+      "hi",
+      "pt",
+      "it",
+      "nl",
+      "pl",
+      "vi",
+      "th",
+      "tr",
+      "he",
+      "uk",
+      "sv"
+    ],
+    "total_samples": 52000
+  }
+}

sentinel_manifold.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "framework": "Sentinel Manifold",
+  "theorem": "Gradient Axiom: lim_{z\u2192\u221e} F'(z)/F(z) = 1/e",
+  "function": "F(z) = \u03a3_{n=1}^\u221e z^n / n^n (Sophomore's Dream)",
+  "constants": {
+    "INV_E": {
+      "value": 0.36787944117144233,
+      "role": "Vocabulary allocation ratio / embedding gain"
+    },
+    "C1": {
+      "value": -0.007994021805952546,
+      "role": "Attracting fixed point / quantization zero-point"
+    },
+    "C2": {
+      "value": 0.00020005604296784437,
+      "role": "Escape threshold / fertility fairness bound"
+    }
+  },
+  "modality_architecture": {
+    "text": "ByteLevel BPE (32K) with NFKC normalization, 20-language training",
+    "image": "Discrete VQ codebook (16,384 tokens), Cosmos/VQGAN compatible",
+    "audio": "Discrete VQ codebook (8,192 tokens), EnCodec/SoundStream compatible",
+    "video": "Discrete VQ codebook (4,096 tokens), Cosmos-DV compatible"
+  },
+  "innovations": [
+    "1/e-proportioned vocabulary allocation across modalities",
+    "Native multimodal routing with zero-overhead modality switching",
+    "Sentinel special tokens for manifold-aware computation",
+    "20-language multilingual training for cross-lingual fairness",
+    "Code + Math + Scientific notation native support",
+    "Compatible with all HF transformers models"
+  ],
+  "version": "1.0.0",
+  "license": "MIT",
+  "author": "Romain Abdel-Aal (ASI The Sentinel V5.2)"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "backend": "tokenizers",
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": [
+    "<text_start>",
+    "<text_end>",
+    "<image_start>",
+    "<image_end>",
+    "<image>",
+    "<audio_start>",
+    "<audio_end>",
+    "<audio>",
+    "<video_start>",
+    "<video_end>",
+    "<video>",
+    "<sentinel>",
+    "<sentinel_c1>",
+    "<sentinel_c2>",
+    "<scale_1e>",
+    "<translate>",
+    "<summarize>",
+    "<generate>",
+    "<understand>",
+    "<caption>",
+    "<turn>",
+    "<system>",
+    "<user>",
+    "<assistant>",
+    "<code_start>",
+    "<code_end>",
+    "<math_start>",
+    "<math_end>"
+  ],
+  "mask_token": "<mask>",
+  "model_max_length": 8192,
+  "pad_token": "<pad>",
+  "padding_side": "right",
+  "tokenizer_class": "TokenizersBackend",
+  "truncation_side": "right",
+  "unk_token": "<unk>"
+}