ggunio commited on
Commit
2068c6b
·
verified ·
1 Parent(s): 77a029c

Final update: Complete documentation on semantic boundaries, better examples, clear explanation of the value proposition

Browse files
Files changed (1) hide show
  1. app.py +105 -35
app.py CHANGED
@@ -164,21 +164,55 @@ with gr.Blocks(title="B2NL v6.2.1", theme=gr.themes.Soft()) as app:
164
  Unlike traditional tokenizers that use fixed vocabularies, B2NL learns directly from bytes and generates dense embeddings
165
  that capture semantic meaning while achieving 16:1 compression.
166
 
167
- ### 🎯 Purpose & Applications
168
 
169
- This model serves as a **preprocessing layer for inter-modal AI communication**:
170
- - **LLM Cost Reduction**: 75% fewer tokens = 75% cost savings
171
- - **Cross-modal Bridge**: Universal embeddings for text↔image↔audio
172
- - **Multilingual Processing**: 204 languages without language-specific vocabularies
173
- - **Edge Deployment**: Compressed representations for bandwidth-limited scenarios
 
 
174
 
175
- ### ⚙️ Technical Details
 
 
 
176
 
177
- - **Architecture**: 6-layer encoder + 6-layer decoder (244.7M params)
178
- - **Compression**: Fixed 16:1 (48 bytes → 3 embedding vectors)
179
- - **Training**: FLORES-200 dataset (204 languages), 100 epochs
180
- - **Current Mode**: Autoregressive (teacher forcing) - accurate but slow
181
- - **Planned Update**: Non-autoregressive training (November 2025) for 10x speedup
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
  ---
184
  """)
@@ -241,12 +275,16 @@ with gr.Blocks(title="B2NL v6.2.1", theme=gr.themes.Soft()) as app:
241
  label="Enter multiple texts (one per line)",
242
  placeholder="Enter texts in different languages...\nOne text per line",
243
  lines=10,
244
- value="""Hello, world!
245
- 안녕하세요, 반갑습니다.
246
- 你好世界!
247
- こんにちは世界!
248
- Bonjour le monde!
249
- This is a longer sentence to test how the model handles texts that exceed 48 bytes."""
 
 
 
 
250
  )
251
 
252
  batch_btn = gr.Button("🔄 Process Batch", variant="primary")
@@ -259,29 +297,61 @@ This is a longer sentence to test how the model handles texts that exceed 48 byt
259
  gr.Markdown("""
260
  ## Understanding B2NL Tokenization
261
 
262
- ### How It Works
263
 
264
- 1. **Byte-Level Processing**: Reads text as raw bytes (no vocabulary needed)
265
- 2. **Chunking**: Divides text into 48-byte chunks
266
- 3. **Embedding Generation**: Creates 3 dense embedding vectors per chunk
267
- 4. **Reconstruction**: Decoder reconstructs original text from embeddings
268
 
269
- ### Sliding Window for Long Texts
 
 
 
270
 
271
- For texts exceeding 48 bytes:
272
- - First chunk: bytes 0-47
273
- - Second chunk: bytes 40-87 (8-byte overlap)
274
- - Third chunk: bytes 80-127 (8-byte overlap)
275
- - And so on...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276
 
277
- This overlap helps maintain context across chunk boundaries.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
278
 
279
- ### Why Fixed 16:1 Compression?
 
 
 
 
 
280
 
281
- - **Predictable**: Always 48 bytes 3 embeddings
282
- - **Efficient**: Optimal for transformer architecture
283
- - **Universal**: Works equally well for all languages
284
- - **Semantic**: Embeddings capture meaning, not just bytes
285
 
286
  ### Current Limitations
287
 
 
164
  Unlike traditional tokenizers that use fixed vocabularies, B2NL learns directly from bytes and generates dense embeddings
165
  that capture semantic meaning while achieving 16:1 compression.
166
 
167
+ ### 🔬 How the 16:1 Compression Works
168
 
169
+ ```
170
+ Input: 48 bytes (including padding/special tokens)
171
+
172
+ Processing: Byte-level analysis with learned boundaries
173
+
174
+ Output: 3 embedding vectors (768-dim each)
175
+ ```
176
 
177
+ **Key Innovation**: The model learns to identify **semantic boundaries** within the 48-byte window.
178
+ Instead of splitting at arbitrary points, it discovers natural language units (words, morphemes, phrases)
179
+ and encodes them into meaningful embeddings. This is why "Hello, world!" (13 bytes) still generates
180
+ 3 embeddings - the model pads to 48 bytes but learns which parts contain actual information.
181
 
182
+ ### 🎯 Why This Matters
183
+
184
+ 1. **Semantic Preservation**: Unlike byte-pair encoding (BPE) which can split words arbitrarily,
185
+ B2NL respects semantic boundaries learned from data.
186
+
187
+ 2. **Language Agnostic**: No vocabulary needed - works equally well for all 204 languages.
188
+ Korean "안녕하세요" and English "Hello" are processed the same way.
189
+
190
+ 3. **Predictable Costs**: Always 16:1 compression means predictable API costs for LLMs.
191
+ 48 bytes → 3 embeddings, always.
192
+
193
+ 4. **Inter-modal Bridge**: These embeddings can be used as a universal representation
194
+ for cross-modal tasks (text→image, text→audio, etc.)
195
+
196
+ ### 🎯 Real-World Applications
197
+
198
+ - **LLM Cost Reduction**: 75% fewer tokens = 75% cost savings on API calls
199
+ - **Multilingual Search**: Single embedding space for 204 languages
200
+ - **Edge AI**: Compressed representations for bandwidth-limited IoT devices
201
+ - **Cross-modal AI**: Universal embeddings for multimodal models
202
+
203
+ ### ⚙️ Technical Architecture
204
+
205
+ - **Encoder**: 6 layers, progressive dimension reduction
206
+ - **Decoder**: 6 layers with cross-attention, reconstructs from embeddings
207
+ - **Boundary Learning**: Gumbel-Softmax for differentiable boundary detection
208
+ - **Total Parameters**: 244.7M (137.9M encoder + 106.8M decoder)
209
+ - **Training**: FLORES-200 (204 languages), 100 epochs, teacher forcing
210
+
211
+ ### ⚠️ Current Limitations
212
+
213
+ - **Mode**: Autoregressive (teacher forcing only) - ~500ms per generation
214
+ - **Long Texts**: Quality decreases for texts > 48 bytes (sliding window limitation)
215
+ - **Coming Soon**: Non-autoregressive training (November 2025) for 10x speedup
216
 
217
  ---
218
  """)
 
275
  label="Enter multiple texts (one per line)",
276
  placeholder="Enter texts in different languages...\nOne text per line",
277
  lines=10,
278
+ value="""The quick brown fox jumps over the lazy dog.
279
+ 안녕하세요, 반갑습니다. 오늘 날씨가 정말 좋네요.
280
+ 你好世界!今天天气很好,我们一起去散步吧。
281
+ こんにちは世界!今日はいい天気ですね。散歩に行きましょう。
282
+ Bonjour le monde! Comment allez-vous aujourd'hui?
283
+ مرحبا بالعالم! كيف حالك اليوم؟ الطقس جميل جداً.
284
+ Привет мир! Как дела? Погода сегодня прекрасная!
285
+ This text is exactly 48 bytes long for testing!
286
+ Short text
287
+ A much longer text that definitely exceeds 48 bytes and will require sliding window processing with 8-byte overlaps between chunks."""
288
  )
289
 
290
  batch_btn = gr.Button("🔄 Process Batch", variant="primary")
 
297
  gr.Markdown("""
298
  ## Understanding B2NL Tokenization
299
 
300
+ ### 🔬 The Core Innovation: Learned Semantic Boundaries
301
 
302
+ Traditional tokenizers use fixed rules (BPE, WordPiece) that can split words arbitrarily.
303
+ B2NL learns to identify **semantic units** within byte sequences:
 
 
304
 
305
+ ```
306
+ Traditional BPE: "안녕하세요" → "안", "녕", "하", "세", "요" (5 tokens)
307
+ B2NL: "안녕하세요" → [emb1, emb2, emb3] (3 embeddings capturing full meaning)
308
+ ```
309
 
310
+ ### 📐 The 48-Byte → 3 Embeddings Architecture
311
+
312
+ ```
313
+ [48 bytes input] [Encoder] → [3 × 768-dim embeddings] → [Decoder] → [48 bytes output]
314
+ ↑ ↓
315
+ (with padding) (semantic compression)
316
+ ```
317
+
318
+ **Why 48 bytes?**
319
+ - Optimal for GPU parallelization (divisible by 8, 16, 24)
320
+ - Captures most words/phrases in any language
321
+ - Allows consistent 16:1 compression ratio
322
+
323
+ **Why 3 embeddings?**
324
+ - Matches typical semantic units in 48-byte window
325
+ - Provides redundancy for robust reconstruction
326
+ - Optimal for transformer cross-attention
327
+
328
+ ### 🌐 Language-Agnostic Processing
329
 
330
+ The model treats all languages equally at the byte level:
331
+
332
+ | Language | Sample Text | Bytes | Embeddings | Compression |
333
+ |----------|------------|-------|------------|-------------|
334
+ | English | "Hello" | 5 (+43 pad) | 3 | 16:1 |
335
+ | Korean | "안녕하세요" | 15 (+33 pad) | 3 | 16:1 |
336
+ | Chinese | "你好世界" | 12 (+36 pad) | 3 | 16:1 |
337
+ | Arabic | "مرحبا" | 10 (+38 pad) | 3 | 16:1 |
338
+
339
+ All get compressed to 3 embeddings, but the model learns which parts contain information.
340
+
341
+ ### 🔄 Sliding Window for Long Texts
342
+
343
+ For texts exceeding 48 bytes:
344
+ ```
345
+ Text: "This is a very long sentence that exceeds 48 bytes..."
346
 
347
+ Chunk 1: [Bytes 0-47] → 3 embeddings
348
+ ↓ (8-byte overlap)
349
+ Chunk 2: [Bytes 40-87] → 3 embeddings
350
+ ↓ (8-byte overlap)
351
+ Chunk 3: [Bytes 80-127] → 3 embeddings
352
+ ```
353
 
354
+ The 8-byte overlap preserves context across boundaries, preventing word splits.
 
 
 
355
 
356
  ### Current Limitations
357