A newer version of the Gradio SDK is available:
6.0.1
B2NL (Byte-to-Natural Language) Tokenizer - Version Evolution
Executive Summary
B2NL represents an advancement in byte-level tokenization research. The evolution from v6.1.1 to v6.1.3 demonstrates continuous improvement in compression technology, with v6.1.2 achieving 18.6:1 average compression (tested on best_model.pt with 6 languages) and v6.1.3 targeting higher ratios with 204 languages.
π Version Comparison Matrix
| Feature | v6.1.1 | v6.1.2 | v6.1.3 |
|---|---|---|---|
| Chunk Size | 256 bytes | 64 bytes | 64 bytes |
| Compression | ~3:1 actual | 18.6:1 actual* | 64:1 target |
| Language Support | 6 core | 6 core | 204 languages |
| Boundary Learning | β Basic | β Advanced | β Multi-level |
| Cross-Attention | Basic | Enhanced | Full relational |
| Sliding Window | β None | β 8-byte overlap | β Adaptive overlap |
| Training Mode | Teacher forcing | Mixed (50% AR) | Curriculum learning |
| Streaming Support | β None | β Chunked | β Real-time |
| Model Size | ~150M params | ~150M params | ~150M params |
π Performance Metrics
Compression Ratios (Bytes β Tokens)
| Language Type | v6.1.1 | v6.1.2 | v6.1.3 (Target) |
|---|---|---|---|
| Isolating (Chinese) | ~3:1 | 39.0:1 | Target: 50:1 |
| Agglutinative (Korean, Japanese) | ~4:1 | 26.5:1 | Target: 40:1 |
| Fusional (English, Spanish) | ~3:1 | 5.4:1 | Target: 30:1 |
| Average | ~3.3:1 | 18.6:1* | Target: 40:1 |
*Note: v6.1.2 compression rates measured on 6 languages. Performance may vary when scaled to 204 languages (v6.1.3).
Reconstruction Accuracy
| Version | Character Level | Word Level | Semantic |
|---|---|---|---|
| v6.1.1 | ~80% | ~70% | N/A |
| v6.1.2 | 100% | ~95% | N/A |
| v6.1.3 | Target: 95%+ | Target: 93%+ | N/A |
π Major Architectural Changes
v6.1.1 β v6.1.2 Improvements
1. Chunk Size Reduction (256 β 64 bytes)
# v6.1.1
max_seq_len = 256 # Large chunks, less granular
# v6.1.2
max_seq_len = 64 # Optimal for boundary detection
- Impact: 4x more granular processing
- Benefit: Better boundary detection and compression
2. Boundary Learning System
# v6.1.2 introduced three-level boundaries
char_boundaries # Character-level segmentation
eojeol_boundaries # Word/morpheme boundaries (main compression)
phrase_boundaries # Phrase-level grouping
- Impact: Hierarchical compression understanding
- Benefit: Language-agnostic pattern learning
3. Enhanced Cross-Attention
# v6.1.1: Basic attention
attention = torch.matmul(Q, K.T)
# v6.1.2: Relational cross-attention
relations = self.learn_relations(encoder_hidden, decoder_hidden)
cross_attention = self.cross_attention(relations)
- Impact: Better sequence-to-sequence mapping
- Benefit: Improved reconstruction accuracy
4. Sliding Window with Overlap
# v6.1.2 implementation
chunk_size = 62 # Max bytes per chunk
overlap = 8 # Boundary preservation
for i in range(0, len(text), chunk_size - overlap):
process_chunk(text[i:i+chunk_size])
- Impact: Seamless boundary handling
- Benefit: No information loss at chunk boundaries
5. Aggressive Compression Training
# v6.1.2 loss weights
'compression': 2.0, # Heavily weighted
'reconstruction': 1.5, # Balanced with quality
'boundary_detection': 1.0
- Impact: Model prioritizes compression
- Benefit: Achieves higher compression ratios
v6.1.2 β v6.1.3 Advancements
1. Massive Scale (6 β 204 Languages)
# v6.1.3 language groups
Phase 1: 15 isolating languages
Phase 2: +30 agglutinative languages
Phase 3: +50 fusional languages
Phase 4: All 204 Flores-200 languages
- Impact: True universal tokenization
- Benefit: Cross-lingual transfer learning
2. Curriculum Learning
# 4-phase progressive training
Epochs 1-50: Isolating (easiest to compress)
Epochs 51-100: +Agglutinative (medium difficulty)
Epochs 101-200: +Fusional (harder patterns)
Epochs 201+: All 204 languages (full diversity)
- Impact: Stable learning progression
- Benefit: Prevents catastrophic forgetting
3. Unsupervised Learning
# v6.1.2: Supervised with boundary_labels.py
labels = generate_boundary_labels(text)
loss = criterion(predictions, labels)
# v6.1.3: Self-supervised discovery
loss = model.discover_patterns(text) # No external labels
- Impact: Model learns patterns independently
- Benefit: Discovers language-specific optimizations
4. Adaptive Compression
# Dynamic compression based on language type
if is_isolating(lang):
target_compression = 50:1
elif is_agglutinative(lang):
target_compression = 40:1
else: # fusional
target_compression = 30:1
- Impact: Language-aware optimization
- Benefit: Optimal compression per language family
5. Real-time Streaming
# v6.1.3 streaming capability
class StreamingB2NL:
def process_stream(self, byte_stream):
for chunk in stream_chunks(byte_stream, 64):
yield self.compress(chunk)
- Impact: Process infinite streams
- Benefit: Production-ready for real-time applications
π Language Coverage Evolution
v6.1.1 - Proof of Concept (6 languages)
- Korean, English, Chinese, Japanese, Spanish, Arabic
- Focus: Core language types validation
v6.1.2 - Enhanced Version (6 languages)
- Same 6 languages but with:
- Boundary detection
- Sliding window processing
- 2x better compression
v6.1.3 - Universal Scale (204 languages)
- Currently training on full Flores-200 dataset
- Covers 99% of world's written languages
- Includes low-resource languages
- Full Unicode support (emoji, symbols, etc.)
- Note: Compression performance to be validated across all 204 languages
π‘ Key Innovations by Version
v6.1.1 - Foundation
- β Pure byte-level tokenization
- β No vocabulary needed
- β Universal UTF-8 support
- β Basic compression (~3:1)
v6.1.2 - Breakthrough
- β Boundary learning system
- β Sliding window processing
- β Enhanced cross-attention
- β Significant compression (18.6:1)
- β Streaming support
v6.1.3 - World-Class
- π In Training: 204 language support
- π Curriculum learning approach
- π Unsupervised pattern discovery
- π Target: 64:1 compression
- π Cross-lingual transfer
π Training Progress
v6.1.3 Current Status
- Phase: 1 (Isolating languages)
- Languages: 15/204 active
- Current Compression: ~4:1 (improving)
- Reconstruction: 85%+ (rising fast)
- Expected Completion: Phase 4 by epoch 300
π― Use Cases by Version
v6.1.1
- Research prototype
- Concept validation
- Academic papers
v6.1.2 (Current POC)
- Research demonstrations
- Working proof of concept
- 18.6:1 average compression (best_model.pt, 6 languages)
- 100% reconstruction accuracy
- Boundary learning successfully implemented
- Note: High compression may be due to limited language set
v6.1.3 (Future)
- Global-scale applications
- Multi-lingual LLMs
- Universal translation systems
- Cross-lingual search engines
π Why B2NL Matters
Industry Impact
- Research Value: Exploring byte-level compression limits
- Innovation: Learning-based approach without fixed vocabulary
- Potential: Targeting high compression ratios
- Progress: Continuous improvement across versions
Technical Advantages
- No vocabulary management
- No tokenizer updates needed
- Works with any UTF-8 text
- Future-proof architecture
Business Value
- For Research: Novel byte-level approach
- For Development: No vocabulary management
- For Future: Scalable to many languages
- For Testing: Working proof of concept
π Recommendation
For POC/Demo: Use v6.1.2 (best_model.pt)
- Working implementation
- 18.6:1 compression achieved (6 languages)
- 100% reconstruction accuracy
- Successfully demonstrates byte-level compression
- Note: Compression rates may decrease with more languages (204 in v6.1.3)
For future roadmap: Plan for v6.1.3
- 204 language support
- 64:1 compression target
- Currently in training
- Q1 2025 availability
B2NL - Transforming bytes into intelligence, one token at a time.