ggunio commited on
Commit
0cc32d2
ยท
verified ยท
1 Parent(s): ff85374

Update README with v6.2.1 info and author

Browse files
Files changed (1) hide show
  1. README.md +19 -305
README.md CHANGED
@@ -1,320 +1,34 @@
1
  ---
2
- title: B2NL v6.1.2 - Byte-to-Natural Language Tokenizer
3
  emoji: ๐Ÿš€
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
  sdk_version: 4.19.2
8
  app_file: app.py
9
- pinned: false
 
 
 
10
  ---
11
 
12
- # B2NL: Byte-to-Natural Language Tokenizer v6.1.2
13
 
14
- ## Attention Needs No Vocabulary: Pure Learning from Bytes
15
 
16
- [![HuggingFace Space](https://img.shields.io/badge/๐Ÿค—%20Demo-Live-blue)](https://huggingface.co/spaces/ggunio/b2nl-demo)
17
- [![Model](https://img.shields.io/badge/๐Ÿค—%20Model-b2nl--v6.1.1-green)](https://huggingface.co/ggunio/b2nl-v6.1.1)
18
- [![Parameters](https://img.shields.io/badge/Parameters-301.7M-orange)](docs/architecture.md)
19
- [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](LICENSE)
20
 
21
- ---
22
- ## ๐Ÿ”— Resources
23
-
24
- - ๐Ÿ“„ **Paper**: [Read on Zenodo](https://zenodo.org/records/17116281?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImIyNWZiYTQyLWNiNGEtNDBmNi1iNTczLWVkMDJlNDI1YTQ1OSIsImRhdGEiOnt9LCJyYW5kb20iOiI0OWJkZWMzMjJjZTc3OTIwMTk4NTJlNTY1YmNjOGU1ZiJ9.Z_hXEp160tWBD5Qe2laQv1vhS4Js2a0R5BMWYs2PTG5vJMrc8l-BmPAIMya9O_HiN85jYZp-WOMOHg_DTHrg2A) | [PDF](Intelligent%20Tokenizer.pdf)
25
- - ๐Ÿค— **Model**: [Hugging Face - ggunio/intelligent-tokenizer-v6](https://huggingface.co/ggunio/intelligent-tokenizer-v6)
26
- - ๐ŸŽฎ **Live Demo**: [Try on Hugging Face Spaces](https://huggingface.co/spaces/ggunio/intelligent-tokenizer-v6-demo)
27
- - ๐Ÿ“ **Documentation**: [English](paper_english.md) | [ํ•œ๊ตญ์–ด](paper_korean.md)
28
-
29
- ## ๐ŸŽ† Breaking the 64:1 Compression Barrier
30
-
31
- **B2NL** achieves what was thought impossible: **64:1 compression** while maintaining **95%+ reconstruction accuracy** across multiple languages. This isn't incremental improvementโ€”it's a paradigm shift.
32
-
33
-
34
-
35
- **Impact**: Process 10x more text with the same computational resources.
36
-
37
- ---
38
-
39
- ## ๐Ÿš€ Live Demo
40
-
41
- ```bash
42
- # Quick start
43
- python demo.py --interactive
44
-
45
- # Benchmark mode
46
- python demo.py --benchmark
47
- ```
48
-
49
- ### Real-World Results
50
-
51
- ```
52
- ============================================================
53
- B2NL BENCHMARK RESULTS
54
- ============================================================
55
-
56
- Text: The quick brown fox jumps over the lazy dog.
57
- Bytes: 43
58
- Tokens: 3
59
- Compression: 14.3:1
60
- Speed: 15,000 bytes/sec
61
-
62
- Text: ์•ˆ๋…•ํ•˜์„ธ์š”. ์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”.
63
- Bytes: 57
64
- Tokens: 2
65
- Compression: 28.5:1
66
- Speed: 18,500 bytes/sec
67
-
68
- Text: ไปŠๅคฉๅคฉๆฐ”ๅพˆๅฅฝ๏ผŒๆˆ‘ไปฌๅŽปๅ…ฌๅ›ญๆ•ฃๆญฅๅงใ€‚
69
- Bytes: 48
70
- Tokens: 1
71
- Compression: 48.0:1
72
- Speed: 21,000 bytes/sec
73
-
74
- ------------------------------------------------------------
75
- OVERALL STATISTICS
76
- ------------------------------------------------------------
77
- Average compression: 30.3:1
78
- Average speed: 18,166 bytes/sec
79
- Reconstruction accuracy: 96.8%
80
- ```
81
-
82
- ---
83
-
84
- ## ๐ŸŽฏ Key Features
85
-
86
- ### 1. Universal Language Support
87
- - โœ… **6 core languages** optimized (Korean, English, Chinese, Japanese, Spanish, Arabic)
88
- - โœ… **UTF-8 universal** - works with ANY text
89
- - โœ… **Emoji & symbols** fully supported
90
-
91
- ### 2. Breakthrough Compression
92
- | Language | Traditional | B2NL v6.1.2 | Improvement |
93
- |----------|------------|-------------|-------------|
94
- | Chinese | 2-3 bytes/char | 48:1 | **16x better** |
95
- | Korean | 3 bytes/char | 28:1 | **9x better** |
96
- | English | 1 byte/char | 14:1 | **14x better** |
97
-
98
- ### 3. Production Ready
99
- - โœ… Streaming support for real-time processing
100
- - โœ… Sliding window with 8-byte overlap
101
- - โœ… Battle-tested on 1M+ documents
102
- - โœ… <100ms latency for typical requests
103
-
104
- ---
105
-
106
- ## ๐Ÿ”ฌ Technical Innovation
107
-
108
- ### Hierarchical Boundary Learning
109
- ```python
110
- class B2NLTokenizer:
111
- def compress(self, text):
112
- # Level 1: Character boundaries
113
- chars = self.detect_char_boundaries(text)
114
-
115
- # Level 2: Word/morpheme boundaries (main compression)
116
- words = self.detect_word_boundaries(chars)
117
-
118
- # Level 3: Phrase boundaries
119
- phrases = self.detect_phrase_boundaries(words)
120
-
121
- return self.encode_hierarchical(phrases)
122
- ```
123
-
124
- ### Cross-Attention Relations
125
- - Learn semantic relationships between byte sequences
126
- - Preserve meaning during aggressive compression
127
- - Enable near-perfect reconstruction
128
-
129
- ### Sliding Window Processing
130
- ```python
131
- # Process long texts seamlessly
132
- for chunk in sliding_window(text, size=64, overlap=8):
133
- compressed = model.compress(chunk)
134
- # No boundary artifacts!
135
- ```
136
-
137
- ---
138
-
139
- ## ๐Ÿ“Š Performance Metrics
140
-
141
- ### Compression Ratios by Language Type
142
-
143
- | Language Type | Examples | Compression | Reconstruction |
144
- |---------------|----------|-------------|----------------|
145
- | **Isolating** | Chinese, Vietnamese | 45-50:1 | 97% |
146
- | **Agglutinative** | Korean, Japanese | 25-30:1 | 96% |
147
- | **Fusional** | English, Spanish | 12-15:1 | 95% |
148
-
149
- ### Speed Benchmarks
150
-
151
- - **Encoding**: 50,000 tokens/second
152
- - **Decoding**: 45,000 tokens/second
153
- - **Memory**: <2GB for full model
154
- - **Latency**: <10ms for 1KB text
155
-
156
- ---
157
-
158
- ## ๐Ÿ”ง Installation
159
-
160
- ```bash
161
- # Clone repository
162
- git clone https://github.com/yourusername/B2NL
163
- cd B2NL-v6.1.2
164
-
165
- # Install dependencies
166
- pip install torch numpy tqdm
167
-
168
- # Download pre-trained model (optional)
169
- wget https://example.com/b2nl_v612_best.pt -O models/best_model.pt
170
-
171
- # Run demo
172
- python demo.py --interactive
173
- ```
174
-
175
- ---
176
-
177
- ## ๐ŸŽฎ Usage Examples
178
-
179
- ### Python API
180
-
181
- ```python
182
- from b2nl import B2NLTokenizer
183
-
184
- # Initialize
185
- tokenizer = B2NLTokenizer(model_path='models/best_model.pt')
186
-
187
- # Compress text
188
- result = tokenizer.tokenize("์•ˆ๋…•ํ•˜์„ธ์š”. ์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”.")
189
- print(f"Compression: {result['compression_ratio']:.1f}:1")
190
- print(f"Tokens: {result['num_tokens']}")
191
-
192
- # Reconstruct
193
- original = tokenizer.detokenize(result['tokens'])
194
- print(f"Reconstructed: {original}")
195
- ```
196
-
197
- ### Command Line
198
-
199
- ```bash
200
- # Compress a file
201
- python demo.py --compress input.txt output.b2nl
202
-
203
- # Interactive mode
204
- python demo.py --interactive
205
-
206
- # Benchmark
207
- python demo.py --benchmark
208
- ```
209
-
210
- ### Streaming API
211
-
212
- ```python
213
- # Real-time compression
214
- for compressed_chunk in tokenizer.stream_compress(byte_stream):
215
- process(compressed_chunk) # No buffering needed!
216
- ```
217
-
218
- ---
219
-
220
- ## ๐ŸŒ Real-World Applications
221
-
222
- ### 1. LLM Context Extension
223
- - **Before**: 4K token context limit
224
- - **After**: 256K effective context with same memory
225
-
226
- ### 2. Database Storage
227
- - **Before**: 10TB multilingual text database
228
- - **After**: 200GB with B2NL compression
229
-
230
- ### 3. API Rate Limits
231
- - **Before**: 1M tokens/day limit
232
- - **After**: Process 64M tokens worth of text
233
-
234
- ### 4. Edge Deployment
235
- - **Before**: Can't run LLMs on mobile
236
- - **After**: 64x more text on device
237
-
238
- ---
239
-
240
- ## ๐Ÿ“Š Validation Results
241
-
242
- ```
243
- =================================================================
244
- COMPREHENSIVE TEST - B2NL v6.1.2
245
- =================================================================
246
-
247
- Isolating Languages:
248
- Avg Compression: 45.2x
249
- Avg Recovery: 97.1%
250
-
251
- Agglutinative Languages:
252
- Avg Compression: 28.7x
253
- Avg Recovery: 96.3%
254
-
255
- Fusional Languages:
256
- Avg Compression: 13.8x
257
- Avg Recovery: 95.2%
258
-
259
- OVERALL PERFORMANCE:
260
- Average Compression: 29.2x
261
- Average Recovery: 96.2%
262
- Streaming Compression: 31.5x
263
-
264
- RECOMMENDATION:
265
- [EXCELLENT] Model is ready for deployment!
266
- - High recovery accuracy: 96.2%
267
- - Good compression ratio: 29.2x
268
- - Production ready
269
- ```
270
-
271
- ---
272
-
273
- ## ๐Ÿš€ Roadmap
274
-
275
- ### v6.1.2
276
- - โœ… 64:1 compression for isolating languages
277
- - โœ… 30:1 average compression
278
- - โœ… 95%+ reconstruction
279
- - โœ… Streaming support
280
-
281
- ### v6.1.3 (In Training)
282
- - ๐Ÿ”„ 204 language support (Flores-200)
283
- - ๐Ÿ”„ Curriculum learning
284
- - ๐Ÿ”„ Target: 64:1 average compression
285
- - ๐Ÿ”„ Q4 2025 release
286
-
287
-
288
- ## ๐Ÿค Contributing
289
-
290
- We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
291
-
292
-
293
-
294
- ## ๐Ÿ“„ Citation
295
-
296
-
297
-
298
- ## ๐Ÿ“ Citation
299
-
300
- ```bibtex
301
- @software{b2nl2025,
302
- title = {B2NL: Byte-to-Natural-Language Universal Tokenizer},
303
- author = {Jinhyun, Woo},
304
- year = {2025},
305
- version = {6.1.1},
306
- note = {97.71% reconstruction, 100% byte-exact for 6 languages},
307
- url = {https://github.com/Woojiggun/intelligent-tokenizer}
308
- }
309
- ```
310
-
311
- ---
312
 
313
- ## ๐Ÿ“ฌ Contact
 
 
 
314
 
315
- **Author**: Woojin Gun (ggunio)
316
- - GitHub: [@Woojiggun](https://github.com/Woojiggun)
317
- - HuggingFace: [@ggunio](https://huggingface.co/ggunio)
318
- - Project: [intelligent-tokenizer](https://github.com/Woojiggun/intelligent-tokenizer)
319
 
320
- # Trigger rebuild
 
 
 
1
  ---
2
+ title: B2NL v6.2.1 - Byte-to-Natural Language Tokenizer ๐Ÿš€
3
  emoji: ๐Ÿš€
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
  sdk_version: 4.19.2
8
  app_file: app.py
9
+ pinned: true
10
+ license: apache-2.0
11
+ models:
12
+ - ggunio/B2NL-IntelligentTokenizer-v6.2.1
13
  ---
14
 
15
+ # B2NL v6.2.1 - Byte-to-Natural Language Tokenizer ๐Ÿš€
16
 
17
+ **Compress and reconstruct text with token boundaries**
18
 
19
+ โš ๏ธ **IMPORTANT: Currently in AUTOREGRESSIVE MODE**
20
+ - Current: ~500ms inference (Teacher Forcing training)
21
+ - Coming Soon (November 2025): Non-autoregressive training (<50ms)
 
22
 
23
+ ## ๐ŸŒŸ What's New in v6.2.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ - **204 languages** support (up from 6)
26
+ - **16:1 fixed compression** ratio
27
+ - **Multi-Query Attention** (8x memory reduction)
28
+ - Model: [ggunio/B2NL-IntelligentTokenizer-v6.2.1](https://huggingface.co/ggunio/B2NL-IntelligentTokenizer-v6.2.1)
29
 
30
+ ## Author
 
 
 
31
 
32
+ **Jinhyun Woo**
33
+ - GitHub: [Woojiggun/intelligent-tokenizer](https://github.com/Woojiggun/intelligent-tokenizer)
34
+ - Paper: [Zenodo](https://zenodo.org/records/17116281)