Spaces:

ggunio
/

intelligent-tokenizer-v6-demo

Sleeping

App Files Files Community

ggunio commited on Oct 6

Commit

0cc32d2

verified ·

1 Parent(s): ff85374

Update README with v6.2.1 info and author

Browse files

Files changed (1) hide show

README.md +19 -305

README.md CHANGED Viewed

@@ -1,320 +1,34 @@
 ---
-title: B2NL v6.1.2 - Byte-to-Natural Language Tokenizer
 emoji: 🚀
 colorFrom: blue
 colorTo: purple
 sdk: gradio
 sdk_version: 4.19.2
 app_file: app.py
-pinned: false
 ---
-# B2NL: Byte-to-Natural Language Tokenizer v6.1.2
-## Attention Needs No Vocabulary: Pure Learning from Bytes
-[![HuggingFace Space](https://img.shields.io/badge/🤗%20Demo-Live-blue)](https://huggingface.co/spaces/ggunio/b2nl-demo)
-[![Model](https://img.shields.io/badge/🤗%20Model-b2nl--v6.1.1-green)](https://huggingface.co/ggunio/b2nl-v6.1.1)
-[![Parameters](https://img.shields.io/badge/Parameters-301.7M-orange)](docs/architecture.md)
-[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](LICENSE)
----
-## 🔗 Resources
-- 📄 **Paper**: [Read on Zenodo](https://zenodo.org/records/17116281?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImIyNWZiYTQyLWNiNGEtNDBmNi1iNTczLWVkMDJlNDI1YTQ1OSIsImRhdGEiOnt9LCJyYW5kb20iOiI0OWJkZWMzMjJjZTc3OTIwMTk4NTJlNTY1YmNjOGU1ZiJ9.Z_hXEp160tWBD5Qe2laQv1vhS4Js2a0R5BMWYs2PTG5vJMrc8l-BmPAIMya9O_HiN85jYZp-WOMOHg_DTHrg2A) | [PDF](Intelligent%20Tokenizer.pdf)
-- 🤗 **Model**: [Hugging Face - ggunio/intelligent-tokenizer-v6](https://huggingface.co/ggunio/intelligent-tokenizer-v6)
-- 🎮 **Live Demo**: [Try on Hugging Face Spaces](https://huggingface.co/spaces/ggunio/intelligent-tokenizer-v6-demo)
-- 📝 **Documentation**: [English](paper_english.md) | [한국어](paper_korean.md)
-## 🎆 Breaking the 64:1 Compression Barrier
-**B2NL** achieves what was thought impossible: **64:1 compression** while maintaining **95%+ reconstruction accuracy** across multiple languages. This isn't incremental improvement—it's a paradigm shift.
-**Impact**: Process 10x more text with the same computational resources.
----
-## 🚀 Live Demo
-```bash
-# Quick start
-python demo.py --interactive
-# Benchmark mode
-python demo.py --benchmark
-```
-### Real-World Results
-```
-============================================================
-B2NL BENCHMARK RESULTS
-============================================================
-Text: The quick brown fox jumps over the lazy dog.
-  Bytes: 43
-  Tokens: 3
-  Compression: 14.3:1
-  Speed: 15,000 bytes/sec
-Text: 안녕하세요. 오늘 날씨가 정말 좋네요.
-  Bytes: 57
-  Tokens: 2
-  Compression: 28.5:1
-  Speed: 18,500 bytes/sec
-Text: 今天天气很好，我们去公园散步吧。
-  Bytes: 48
-  Tokens: 1
-  Compression: 48.0:1
-  Speed: 21,000 bytes/sec
-------------------------------------------------------------
-OVERALL STATISTICS
-------------------------------------------------------------
-Average compression: 30.3:1
-Average speed: 18,166 bytes/sec
-Reconstruction accuracy: 96.8%
-```
----
-## 🎯 Key Features
-### 1. Universal Language Support
-- ✅ **6 core languages** optimized (Korean, English, Chinese, Japanese, Spanish, Arabic)
-- ✅ **UTF-8 universal** - works with ANY text
-- ✅ **Emoji & symbols** fully supported
-### 2. Breakthrough Compression
-| Language | Traditional | B2NL v6.1.2 | Improvement |
-|----------|------------|-------------|-------------|
-| Chinese | 2-3 bytes/char | 48:1 | **16x better** |
-| Korean | 3 bytes/char | 28:1 | **9x better** |
-| English | 1 byte/char | 14:1 | **14x better** |
-### 3. Production Ready
-- ✅ Streaming support for real-time processing
-- ✅ Sliding window with 8-byte overlap
-- ✅ Battle-tested on 1M+ documents
-- ✅ <100ms latency for typical requests
----
-## 🔬 Technical Innovation
-### Hierarchical Boundary Learning
-```python
-class B2NLTokenizer:
-    def compress(self, text):
-        # Level 1: Character boundaries
-        chars = self.detect_char_boundaries(text)
-        # Level 2: Word/morpheme boundaries (main compression)
-        words = self.detect_word_boundaries(chars)
-        # Level 3: Phrase boundaries
-        phrases = self.detect_phrase_boundaries(words)
-        return self.encode_hierarchical(phrases)
-```
-### Cross-Attention Relations
-- Learn semantic relationships between byte sequences
-- Preserve meaning during aggressive compression
-- Enable near-perfect reconstruction
-### Sliding Window Processing
-```python
-# Process long texts seamlessly
-for chunk in sliding_window(text, size=64, overlap=8):
-    compressed = model.compress(chunk)
-    # No boundary artifacts!
-```
----
-## 📊 Performance Metrics
-### Compression Ratios by Language Type
-| Language Type | Examples | Compression | Reconstruction |
-|---------------|----------|-------------|----------------|
-| **Isolating** | Chinese, Vietnamese | 45-50:1 | 97% |
-| **Agglutinative** | Korean, Japanese | 25-30:1 | 96% |
-| **Fusional** | English, Spanish | 12-15:1 | 95% |
-### Speed Benchmarks
-- **Encoding**: 50,000 tokens/second
-- **Decoding**: 45,000 tokens/second
-- **Memory**: <2GB for full model
-- **Latency**: <10ms for 1KB text
----
-## 🔧 Installation
-```bash
-# Clone repository
-git clone https://github.com/yourusername/B2NL
-cd B2NL-v6.1.2
-# Install dependencies
-pip install torch numpy tqdm
-# Download pre-trained model (optional)
-wget https://example.com/b2nl_v612_best.pt -O models/best_model.pt
-# Run demo
-python demo.py --interactive
-```
----
-## 🎮 Usage Examples
-### Python API
-```python
-from b2nl import B2NLTokenizer
-# Initialize
-tokenizer = B2NLTokenizer(model_path='models/best_model.pt')
-# Compress text
-result = tokenizer.tokenize("안녕하세요. 오늘 날씨가 좋네요.")
-print(f"Compression: {result['compression_ratio']:.1f}:1")
-print(f"Tokens: {result['num_tokens']}")
-# Reconstruct
-original = tokenizer.detokenize(result['tokens'])
-print(f"Reconstructed: {original}")
-```
-### Command Line
-```bash
-# Compress a file
-python demo.py --compress input.txt output.b2nl
-# Interactive mode
-python demo.py --interactive
-# Benchmark
-python demo.py --benchmark
-```
-### Streaming API
-```python
-# Real-time compression
-for compressed_chunk in tokenizer.stream_compress(byte_stream):
-    process(compressed_chunk)  # No buffering needed!
-```
----
-## 🌐 Real-World Applications
-### 1. LLM Context Extension
-- **Before**: 4K token context limit
-- **After**: 256K effective context with same memory
-### 2. Database Storage
-- **Before**: 10TB multilingual text database
-- **After**: 200GB with B2NL compression
-### 3. API Rate Limits
-- **Before**: 1M tokens/day limit
-- **After**: Process 64M tokens worth of text
-### 4. Edge Deployment
-- **Before**: Can't run LLMs on mobile
-- **After**: 64x more text on device
----
-## 📊 Validation Results
-```
-=================================================================
-COMPREHENSIVE TEST - B2NL v6.1.2
-=================================================================
-Isolating Languages:
-  Avg Compression: 45.2x
-  Avg Recovery: 97.1%
-Agglutinative Languages:
-  Avg Compression: 28.7x
-  Avg Recovery: 96.3%
-Fusional Languages:
-  Avg Compression: 13.8x
-  Avg Recovery: 95.2%
-OVERALL PERFORMANCE:
-  Average Compression: 29.2x
-  Average Recovery: 96.2%
-  Streaming Compression: 31.5x
-RECOMMENDATION:
-[EXCELLENT] Model is ready for deployment!
-   - High recovery accuracy: 96.2%
-   - Good compression ratio: 29.2x
-   - Production ready
-```
----
-## 🚀 Roadmap
-### v6.1.2
-- ✅ 64:1 compression for isolating languages
-- ✅ 30:1 average compression
-- ✅ 95%+ reconstruction
-- ✅ Streaming support
-### v6.1.3 (In Training)
-- 🔄 204 language support (Flores-200)
-- 🔄 Curriculum learning
-- 🔄 Target: 64:1 average compression
-- 🔄 Q4 2025 release
-## 🤝 Contributing
-We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
-## 📄 Citation
-## 📝 Citation
-```bibtex
-@software{b2nl2025,
-  title = {B2NL: Byte-to-Natural-Language Universal Tokenizer},
-  author = {Jinhyun, Woo},
-  year = {2025},
-  version = {6.1.1},
-  note = {97.71% reconstruction, 100% byte-exact for 6 languages},
-  url = {https://github.com/Woojiggun/intelligent-tokenizer}
-}
-```
----
-## 📬 Contact
-**Author**: Woojin Gun (ggunio)
-- GitHub: [@Woojiggun](https://github.com/Woojiggun)
-- HuggingFace: [@ggunio](https://huggingface.co/ggunio)
-- Project: [intelligent-tokenizer](https://github.com/Woojiggun/intelligent-tokenizer)
-# Trigger rebuild

 ---
+title: B2NL v6.2.1 - Byte-to-Natural Language Tokenizer 🚀
 emoji: 🚀
 colorFrom: blue
 colorTo: purple
 sdk: gradio
 sdk_version: 4.19.2
 app_file: app.py
+pinned: true
+license: apache-2.0
+models:
+- ggunio/B2NL-IntelligentTokenizer-v6.2.1
 ---
+# B2NL v6.2.1 - Byte-to-Natural Language Tokenizer 🚀
+**Compress and reconstruct text with token boundaries**
+⚠️ **IMPORTANT: Currently in AUTOREGRESSIVE MODE**
+- Current: ~500ms inference (Teacher Forcing training)
+- Coming Soon (November 2025): Non-autoregressive training (<50ms)
+## 🌟 What's New in v6.2.1
+- **204 languages** support (up from 6)
+- **16:1 fixed compression** ratio
+- **Multi-Query Attention** (8x memory reduction)
+- Model: [ggunio/B2NL-IntelligentTokenizer-v6.2.1](https://huggingface.co/ggunio/B2NL-IntelligentTokenizer-v6.2.1)
+## Author
+**Jinhyun Woo**
+- GitHub: [Woojiggun/intelligent-tokenizer](https://github.com/Woojiggun/intelligent-tokenizer)
+- Paper: [Zenodo](https://zenodo.org/records/17116281)