Spaces:

ggunio
/

intelligent-tokenizer-v6-demo

Sleeping

App Files Files Community

ggunio commited on Oct 6

Commit

2068c6b

verified ·

1 Parent(s): 77a029c

Final update: Complete documentation on semantic boundaries, better examples, clear explanation of the value proposition

Browse files

Files changed (1) hide show

app.py +105 -35

app.py CHANGED Viewed

@@ -164,21 +164,55 @@ with gr.Blocks(title="B2NL v6.2.1", theme=gr.themes.Soft()) as app:
     Unlike traditional tokenizers that use fixed vocabularies, B2NL learns directly from bytes and generates dense embeddings
     that capture semantic meaning while achieving 16:1 compression.
-    ### 🎯 Purpose & Applications
-    This model serves as a **preprocessing layer for inter-modal AI communication**:
-    - **LLM Cost Reduction**: 75% fewer tokens = 75% cost savings
-    - **Cross-modal Bridge**: Universal embeddings for text↔image↔audio
-    - **Multilingual Processing**: 204 languages without language-specific vocabularies
-    - **Edge Deployment**: Compressed representations for bandwidth-limited scenarios
-    ### ⚙️ Technical Details
-    - **Architecture**: 6-layer encoder + 6-layer decoder (244.7M params)
-    - **Compression**: Fixed 16:1 (48 bytes → 3 embedding vectors)
-    - **Training**: FLORES-200 dataset (204 languages), 100 epochs
-    - **Current Mode**: Autoregressive (teacher forcing) - accurate but slow
-    - **Planned Update**: Non-autoregressive training (November 2025) for 10x speedup
     ---
     """)
@@ -241,12 +275,16 @@ with gr.Blocks(title="B2NL v6.2.1", theme=gr.themes.Soft()) as app:
             label="Enter multiple texts (one per line)",
             placeholder="Enter texts in different languages...\nOne text per line",
             lines=10,
-            value="""Hello, world!
-안녕하세요, 반갑습니다.
-你好世界！
-こんにちは世界！
-Bonjour le monde!
-This is a longer sentence to test how the model handles texts that exceed 48 bytes."""
         )
         batch_btn = gr.Button("🔄 Process Batch", variant="primary")
@@ -259,29 +297,61 @@ This is a longer sentence to test how the model handles texts that exceed 48 byt
         gr.Markdown("""
         ## Understanding B2NL Tokenization
-        ### How It Works
-        1. **Byte-Level Processing**: Reads text as raw bytes (no vocabulary needed)
-        2. **Chunking**: Divides text into 48-byte chunks
-        3. **Embedding Generation**: Creates 3 dense embedding vectors per chunk
-        4. **Reconstruction**: Decoder reconstructs original text from embeddings
-        ### Sliding Window for Long Texts
-        For texts exceeding 48 bytes:
-        - First chunk: bytes 0-47
-        - Second chunk: bytes 40-87 (8-byte overlap)
-        - Third chunk: bytes 80-127 (8-byte overlap)
-        - And so on...
-        This overlap helps maintain context across chunk boundaries.
-        ### Why Fixed 16:1 Compression?
-        - **Predictable**: Always 48 bytes → 3 embeddings
-        - **Efficient**: Optimal for transformer architecture
-        - **Universal**: Works equally well for all languages
-        - **Semantic**: Embeddings capture meaning, not just bytes
         ### Current Limitations

     Unlike traditional tokenizers that use fixed vocabularies, B2NL learns directly from bytes and generates dense embeddings
     that capture semantic meaning while achieving 16:1 compression.
+    ### 🔬 How the 16:1 Compression Works
+    ```
+    Input: 48 bytes (including padding/special tokens)
+           ↓
+    Processing: Byte-level analysis with learned boundaries
+           ↓
+    Output: 3 embedding vectors (768-dim each)
+    ```
+    **Key Innovation**: The model learns to identify **semantic boundaries** within the 48-byte window.
+    Instead of splitting at arbitrary points, it discovers natural language units (words, morphemes, phrases)
+    and encodes them into meaningful embeddings. This is why "Hello, world!" (13 bytes) still generates
+    3 embeddings - the model pads to 48 bytes but learns which parts contain actual information.
+    ### 🎯 Why This Matters
+    1. **Semantic Preservation**: Unlike byte-pair encoding (BPE) which can split words arbitrarily,
+       B2NL respects semantic boundaries learned from data.
+    2. **Language Agnostic**: No vocabulary needed - works equally well for all 204 languages.
+       Korean "안녕하세요" and English "Hello" are processed the same way.
+    3. **Predictable Costs**: Always 16:1 compression means predictable API costs for LLMs.
+       48 bytes → 3 embeddings, always.
+    4. **Inter-modal Bridge**: These embeddings can be used as a universal representation
+       for cross-modal tasks (text→image, text→audio, etc.)
+    ### 🎯 Real-World Applications
+    - **LLM Cost Reduction**: 75% fewer tokens = 75% cost savings on API calls
+    - **Multilingual Search**: Single embedding space for 204 languages
+    - **Edge AI**: Compressed representations for bandwidth-limited IoT devices
+    - **Cross-modal AI**: Universal embeddings for multimodal models
+    ### ⚙️ Technical Architecture
+    - **Encoder**: 6 layers, progressive dimension reduction
+    - **Decoder**: 6 layers with cross-attention, reconstructs from embeddings
+    - **Boundary Learning**: Gumbel-Softmax for differentiable boundary detection
+    - **Total Parameters**: 244.7M (137.9M encoder + 106.8M decoder)
+    - **Training**: FLORES-200 (204 languages), 100 epochs, teacher forcing
+    ### ⚠️ Current Limitations
+    - **Mode**: Autoregressive (teacher forcing only) - ~500ms per generation
+    - **Long Texts**: Quality decreases for texts > 48 bytes (sliding window limitation)
+    - **Coming Soon**: Non-autoregressive training (November 2025) for 10x speedup
     ---
     """)
             label="Enter multiple texts (one per line)",
             placeholder="Enter texts in different languages...\nOne text per line",
             lines=10,
+            value="""The quick brown fox jumps over the lazy dog.
+안녕하세요, 반갑습니다. 오늘 날씨가 정말 좋네요.
+你好世界！今天天气很好，我们一起去散步吧。
+こんにちは世界！今日はいい天気ですね。散歩に行きましょう。
+Bonjour le monde! Comment allez-vous aujourd'hui?
+مرحبا بالعالم! كيف حالك اليوم؟ الطقس جميل جداً.
+Привет мир! Как дела? Погода сегодня прекрасная!
+This text is exactly 48 bytes long for testing!
+Short text
+A much longer text that definitely exceeds 48 bytes and will require sliding window processing with 8-byte overlaps between chunks."""
         )
         batch_btn = gr.Button("🔄 Process Batch", variant="primary")
         gr.Markdown("""
         ## Understanding B2NL Tokenization
+        ### 🔬 The Core Innovation: Learned Semantic Boundaries
+        Traditional tokenizers use fixed rules (BPE, WordPiece) that can split words arbitrarily.
+        B2NL learns to identify **semantic units** within byte sequences:
+        ```
+        Traditional BPE:  "안녕하세요" → "안", "녕", "하", "세", "요" (5 tokens)
+        B2NL:            "안녕하세요" → [emb1, emb2, emb3] (3 embeddings capturing full meaning)
+        ```
+        ### 📐 The 48-Byte → 3 Embeddings Architecture
+        ```
+        [48 bytes input] → [Encoder] → [3 × 768-dim embeddings] → [Decoder] → [48 bytes output]
+                 ↑                              ↓
+            (with padding)             (semantic compression)
+        ```
+        **Why 48 bytes?**
+        - Optimal for GPU parallelization (divisible by 8, 16, 24)
+        - Captures most words/phrases in any language
+        - Allows consistent 16:1 compression ratio
+        **Why 3 embeddings?**
+        - Matches typical semantic units in 48-byte window
+        - Provides redundancy for robust reconstruction
+        - Optimal for transformer cross-attention
+        ### 🌐 Language-Agnostic Processing
+        The model treats all languages equally at the byte level:
+        | Language | Sample Text | Bytes | Embeddings | Compression |
+        |----------|------------|-------|------------|-------------|
+        | English | "Hello" | 5 (+43 pad) | 3 | 16:1 |
+        | Korean | "안녕하세요" | 15 (+33 pad) | 3 | 16:1 |
+        | Chinese | "你好世界" | 12 (+36 pad) | 3 | 16:1 |
+        | Arabic | "مرحبا" | 10 (+38 pad) | 3 | 16:1 |
+        All get compressed to 3 embeddings, but the model learns which parts contain information.
+        ### 🔄 Sliding Window for Long Texts
+        For texts exceeding 48 bytes:
+        ```
+        Text: "This is a very long sentence that exceeds 48 bytes..."
+        Chunk 1: [Bytes 0-47]   → 3 embeddings
+                      ↓ (8-byte overlap)
+        Chunk 2: [Bytes 40-87]  → 3 embeddings
+                      ↓ (8-byte overlap)
+        Chunk 3: [Bytes 80-127] → 3 embeddings
+        ```
+        The 8-byte overlap preserves context across boundaries, preventing word splits.
         ### Current Limitations