intelligent-tokenizer-v6-demo / VERSION_COMPARISON.md
ggunio's picture
Update to B2NL v6.1.2 POC - 18.6:1 compression with 6 languages (Korean, English, Chinese, Japanese, Spanish, Arabic)
13c2c77

A newer version of the Gradio SDK is available: 6.0.1

Upgrade

B2NL (Byte-to-Natural Language) Tokenizer - Version Evolution

Executive Summary

B2NL represents an advancement in byte-level tokenization research. The evolution from v6.1.1 to v6.1.3 demonstrates continuous improvement in compression technology, with v6.1.2 achieving 18.6:1 average compression (tested on best_model.pt with 6 languages) and v6.1.3 targeting higher ratios with 204 languages.


πŸš€ Version Comparison Matrix

Feature v6.1.1 v6.1.2 v6.1.3
Chunk Size 256 bytes 64 bytes 64 bytes
Compression ~3:1 actual 18.6:1 actual* 64:1 target
Language Support 6 core 6 core 204 languages
Boundary Learning ❌ Basic βœ… Advanced βœ… Multi-level
Cross-Attention Basic Enhanced Full relational
Sliding Window ❌ None βœ… 8-byte overlap βœ… Adaptive overlap
Training Mode Teacher forcing Mixed (50% AR) Curriculum learning
Streaming Support ❌ None βœ… Chunked βœ… Real-time
Model Size ~150M params ~150M params ~150M params

πŸ“Š Performance Metrics

Compression Ratios (Bytes β†’ Tokens)

Language Type v6.1.1 v6.1.2 v6.1.3 (Target)
Isolating (Chinese) ~3:1 39.0:1 Target: 50:1
Agglutinative (Korean, Japanese) ~4:1 26.5:1 Target: 40:1
Fusional (English, Spanish) ~3:1 5.4:1 Target: 30:1
Average ~3.3:1 18.6:1* Target: 40:1

*Note: v6.1.2 compression rates measured on 6 languages. Performance may vary when scaled to 204 languages (v6.1.3).

Reconstruction Accuracy

Version Character Level Word Level Semantic
v6.1.1 ~80% ~70% N/A
v6.1.2 100% ~95% N/A
v6.1.3 Target: 95%+ Target: 93%+ N/A

πŸ”„ Major Architectural Changes

v6.1.1 β†’ v6.1.2 Improvements

1. Chunk Size Reduction (256 β†’ 64 bytes)

# v6.1.1
max_seq_len = 256  # Large chunks, less granular

# v6.1.2
max_seq_len = 64   # Optimal for boundary detection
  • Impact: 4x more granular processing
  • Benefit: Better boundary detection and compression

2. Boundary Learning System

# v6.1.2 introduced three-level boundaries
char_boundaries    # Character-level segmentation
eojeol_boundaries  # Word/morpheme boundaries (main compression)
phrase_boundaries  # Phrase-level grouping
  • Impact: Hierarchical compression understanding
  • Benefit: Language-agnostic pattern learning

3. Enhanced Cross-Attention

# v6.1.1: Basic attention
attention = torch.matmul(Q, K.T)

# v6.1.2: Relational cross-attention
relations = self.learn_relations(encoder_hidden, decoder_hidden)
cross_attention = self.cross_attention(relations)
  • Impact: Better sequence-to-sequence mapping
  • Benefit: Improved reconstruction accuracy

4. Sliding Window with Overlap

# v6.1.2 implementation
chunk_size = 62  # Max bytes per chunk
overlap = 8      # Boundary preservation
for i in range(0, len(text), chunk_size - overlap):
    process_chunk(text[i:i+chunk_size])
  • Impact: Seamless boundary handling
  • Benefit: No information loss at chunk boundaries

5. Aggressive Compression Training

# v6.1.2 loss weights
'compression': 2.0,      # Heavily weighted
'reconstruction': 1.5,   # Balanced with quality
'boundary_detection': 1.0
  • Impact: Model prioritizes compression
  • Benefit: Achieves higher compression ratios

v6.1.2 β†’ v6.1.3 Advancements

1. Massive Scale (6 β†’ 204 Languages)

# v6.1.3 language groups
Phase 1: 15 isolating languages
Phase 2: +30 agglutinative languages
Phase 3: +50 fusional languages
Phase 4: All 204 Flores-200 languages
  • Impact: True universal tokenization
  • Benefit: Cross-lingual transfer learning

2. Curriculum Learning

# 4-phase progressive training
Epochs 1-50:   Isolating (easiest to compress)
Epochs 51-100: +Agglutinative (medium difficulty)
Epochs 101-200: +Fusional (harder patterns)
Epochs 201+:   All 204 languages (full diversity)
  • Impact: Stable learning progression
  • Benefit: Prevents catastrophic forgetting

3. Unsupervised Learning

# v6.1.2: Supervised with boundary_labels.py
labels = generate_boundary_labels(text)
loss = criterion(predictions, labels)

# v6.1.3: Self-supervised discovery
loss = model.discover_patterns(text)  # No external labels
  • Impact: Model learns patterns independently
  • Benefit: Discovers language-specific optimizations

4. Adaptive Compression

# Dynamic compression based on language type
if is_isolating(lang):
    target_compression = 50:1
elif is_agglutinative(lang):
    target_compression = 40:1
else:  # fusional
    target_compression = 30:1
  • Impact: Language-aware optimization
  • Benefit: Optimal compression per language family

5. Real-time Streaming

# v6.1.3 streaming capability
class StreamingB2NL:
    def process_stream(self, byte_stream):
        for chunk in stream_chunks(byte_stream, 64):
            yield self.compress(chunk)
  • Impact: Process infinite streams
  • Benefit: Production-ready for real-time applications

🌍 Language Coverage Evolution

v6.1.1 - Proof of Concept (6 languages)

  • Korean, English, Chinese, Japanese, Spanish, Arabic
  • Focus: Core language types validation

v6.1.2 - Enhanced Version (6 languages)

  • Same 6 languages but with:
    • Boundary detection
    • Sliding window processing
    • 2x better compression

v6.1.3 - Universal Scale (204 languages)

  • Currently training on full Flores-200 dataset
  • Covers 99% of world's written languages
  • Includes low-resource languages
  • Full Unicode support (emoji, symbols, etc.)
  • Note: Compression performance to be validated across all 204 languages

πŸ’‘ Key Innovations by Version

v6.1.1 - Foundation

  • βœ… Pure byte-level tokenization
  • βœ… No vocabulary needed
  • βœ… Universal UTF-8 support
  • βœ… Basic compression (~3:1)

v6.1.2 - Breakthrough

  • βœ… Boundary learning system
  • βœ… Sliding window processing
  • βœ… Enhanced cross-attention
  • βœ… Significant compression (18.6:1)
  • βœ… Streaming support

v6.1.3 - World-Class

  • πŸ”„ In Training: 204 language support
  • πŸ”„ Curriculum learning approach
  • πŸ”„ Unsupervised pattern discovery
  • πŸ”„ Target: 64:1 compression
  • πŸ”„ Cross-lingual transfer

πŸ“ˆ Training Progress

v6.1.3 Current Status

  • Phase: 1 (Isolating languages)
  • Languages: 15/204 active
  • Current Compression: ~4:1 (improving)
  • Reconstruction: 85%+ (rising fast)
  • Expected Completion: Phase 4 by epoch 300

🎯 Use Cases by Version

v6.1.1

  • Research prototype
  • Concept validation
  • Academic papers

v6.1.2 (Current POC)

  • Research demonstrations
  • Working proof of concept
  • 18.6:1 average compression (best_model.pt, 6 languages)
  • 100% reconstruction accuracy
  • Boundary learning successfully implemented
  • Note: High compression may be due to limited language set

v6.1.3 (Future)

  • Global-scale applications
  • Multi-lingual LLMs
  • Universal translation systems
  • Cross-lingual search engines

πŸš€ Why B2NL Matters

Industry Impact

  1. Research Value: Exploring byte-level compression limits
  2. Innovation: Learning-based approach without fixed vocabulary
  3. Potential: Targeting high compression ratios
  4. Progress: Continuous improvement across versions

Technical Advantages

  • No vocabulary management
  • No tokenizer updates needed
  • Works with any UTF-8 text
  • Future-proof architecture

Business Value

  • For Research: Novel byte-level approach
  • For Development: No vocabulary management
  • For Future: Scalable to many languages
  • For Testing: Working proof of concept

πŸ“‹ Recommendation

For POC/Demo: Use v6.1.2 (best_model.pt)

  • Working implementation
  • 18.6:1 compression achieved (6 languages)
  • 100% reconstruction accuracy
  • Successfully demonstrates byte-level compression
  • Note: Compression rates may decrease with more languages (204 in v6.1.3)

For future roadmap: Plan for v6.1.3

  • 204 language support
  • 64:1 compression target
  • Currently in training
  • Q1 2025 availability

B2NL - Transforming bytes into intelligence, one token at a time.