apehex commited on
Commit
3ec4efe
1 Parent(s): 24a7100

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +208 -5
README.md CHANGED
@@ -2,14 +2,217 @@
2
  library_name: keras
3
  ---
4
 
5
- ## Model description
6
 
7
- More information needed
8
 
9
- ## Intended uses & limitations
10
 
11
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  ## Training and evaluation data
14
 
15
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: keras
3
  ---
4
 
5
+ # tokun
6
 
7
+ > `to-kun` took tokens to t-can
8
 
9
+ Current tokenizers have notorious issues that are bringing all the LLMs down.
10
 
11
+ `tokun` is a model specialized in text embedding.
12
+ It is **lossless** while providing **high input compression**.
13
+
14
+ `tokun` produces vectors of dimension 256 equivalent to 64 UTF-32-BE bytes.
15
+ IE each embedding can be thought of as a *token of length 16 characters*.
16
+
17
+ But these vectors are more than basic IDs, they keep meaningful information on their constituting parts.
18
+
19
+ ## Features
20
+
21
+ The model produces vector embeddings that can be directly ingested by another model.
22
+
23
+ Regular tokens are unrelated IDs, while `tokun` has the following properties:
24
+
25
+ - **international**: `tokun` performs evenly on the whole Unicode space
26
+ - **compression**: the sequence length is divided by 16
27
+ - **embeddings**: the output vectors have only a dimension 256
28
+ - **lossless**: embeddings store all the information up to the byte level
29
+ - **built-ins**: Unicode has built-in special tokens, no need for `<|im_start|>`
30
+ - **meaningful**: embeddings are natively related to each-other based on their parts
31
+
32
+ ## Installation
33
+
34
+ In all cases, the model requires the code from the package `tokun`:
35
+
36
+ ```shell
37
+ pip install tokun
38
+ ```
39
+
40
+ ### From Hugging Face
41
+
42
+ Login to Hugging Face:
43
+
44
+ ```shell
45
+ huggingface-cli login
46
+ ```
47
+
48
+ Download the repository:
49
+
50
+ ```python
51
+ import huggingface_hub as hh
52
+
53
+ api = hh.HfApi()
54
+ api.snapshot_download(repo_id='apehex/tokun', local_dir='tokun/')
55
+ ```
56
+
57
+ Import the tokenizer and model:
58
+
59
+ ```python
60
+ tokenizer = tokun.huggingface.ByteTokenizer()
61
+ model = hh.from_pretrained_keras('tokun/variants/4x16/')
62
+ ```
63
+
64
+ ### With Base Tensorflow / Keras
65
+
66
+ You can directly load the weights [from the repository](../models/).
67
+
68
+ For the most performant variant of the model, `4x16`:
69
+
70
+ ```python
71
+ import tensorflow as tf
72
+ import tokun.model
73
+ import urllib.request
74
+
75
+ urllib.request.urlretrieve('https://github.com/apehex/tokun/raw/main/models/4x16/1/6.3.keras', 'model.keras')
76
+ model = tf.keras.models.load_model('model.keras')
77
+ ```
78
+
79
+ ## Usage
80
+
81
+ Since it is small (between 1 and 2M parameters depending on the variant), the model can also be [trained on Google Colab][notebook-file-tokun-train].
82
+
83
+ We will be encoding and decoding the following sample:
84
+
85
+ ```python
86
+ __s = """Une unité lexicale ou token lexical ou plus simplement token est un couple composé d'un nom et d'une valeur optionnelle (e.g. 135677)."""
87
+ ```
88
+
89
+ ### With Hugging Face
90
+
91
+ The sequence dimension is fixed to 512 because exporting the Keras model requires to specify the input shape.
92
+ So the sample is padded to `16 * 512` characters or `64 * 512` bytes.
93
+
94
+ ```python
95
+ # encode with UTF-32
96
+ __x = tokenizer.batch_encode_plus(batch_text_or_text_pairs=[__s], padding='max_length', max_length=64 * 512, add_special_tokens=False)
97
+ __x = tf.convert_to_tensor(__x['input_ids'])
98
+ # tokenize
99
+ __e = model.layers[1](__x) # encoder
100
+ # these embeddings would be the input of a LLM
101
+ __o = llm(__e) # replace with your LLM
102
+ # detokenize
103
+ __p = model.layers[2](__o) # decoder
104
+ # interpret probabilities as byte indexes
105
+ __y = tokun.pipeline.postprocess(__p)
106
+ ```
107
+
108
+ ```python
109
+ print(len(__s))
110
+ # 252
111
+ print(__x.shape) # 16 * 512 characters = 64 * 512 bytes
112
+ # (1, 32768)
113
+ print(__e.shape) # 512 embeddings
114
+ # (1, 512, 256)
115
+ print(__p.shape) # back to x shape
116
+ # (1, 32768, 256)
117
+ ```
118
+
119
+ > Note: the base Tensorflow implementation operates on any sequence dimension (see below)
120
+
121
+ ### With Base Tensorflow / Keras
122
+
123
+ ```python
124
+ __x = tokun.pipeline.preprocess(text=__s, groups=[4, 16], expand=[1], flatten=True)
125
+ __e = model._encoder(__x) # final embedding = input for another model
126
+ # these embeddings would be the input of a LLM
127
+ __o = llm(__e) # replace with your LLM
128
+ # detokenize
129
+ __p = MODEL._decoder(__o)
130
+ # interpret probabilities as byte indexes
131
+ __y = tokun.pipeline.postprocess(__p)
132
+ ```
133
+
134
+ The OG version doesn't fix the sequence dimension:
135
+
136
+ ```python
137
+ print(len(__s))
138
+ # 252
139
+ print(__x.shape) # 4 * 252 = 1008 padded to 1024 bytes
140
+ # (1, 1024)
141
+ print(__e.shape) # 252 / 16 = 1024 / 64 = 16
142
+ # (1, 16, 256)
143
+ print(__p.shape) # back to x shape
144
+ # (1, 1024, 256)
145
+ ```
146
 
147
  ## Training and evaluation data
148
 
149
+ `tokun` was **trained on random sequences** of UTF-32-BE bytes, so that it covers the first 4 planes of Unicode.
150
+
151
+ Validation was also performed on the 7 languages of [MLQA][github-mlqa] to make sure the model keeps its accuracy on regular text.
152
+
153
+ ## Resources
154
+
155
+ ### Notebooks
156
+
157
+ Final model:
158
+
159
+ - train: [File][notebook-file-tokun-train] / [Colab][notebook-colab-tokun-train]
160
+ - demo: [File][notebook-file-tokun-demo] / [Colab][notebook-colab-tokun-demo]
161
+
162
+ Older / simpler model iterations:
163
+
164
+ - `tokun-1`: [File][notebook-file-tokun-1] / [Colab][notebook-colab-tokun-1]
165
+ - `tokun-4`: [File][notebook-file-tokun-4] / [Colab][notebook-colab-tokun-4]
166
+ - `tokun-16`: [File][notebook-file-tokun-16] / [Colab][notebook-colab-tokun-16]
167
+
168
+ ### Articles
169
+
170
+ Main article:
171
+
172
+ - on [Github][article-file-tokun]
173
+ - on [Hugging Face][article-hugging-face]
174
+
175
+ Notes on each iteration:
176
+
177
+ - `tokun-1`: [Github][article-file-tokun-1]
178
+ - `tokun-4`: [Github][article-file-tokun-4]
179
+ - `tokun-16`: [Github][article-file-tokun-16]
180
+
181
+ ## TODO
182
+
183
+ See [TODO](TODO.md).
184
+
185
+ ## Credits
186
+
187
+ This project was inspired by a video from Andrej Karpathy, ["Let's build the GPT tokenizer"][youtube-karpathy-tokenizer].
188
+
189
+ ## License
190
+
191
+ Licensed under the [aGPLv3](LICENSE.md).
192
+
193
+ [article-file-tokun]: https://github.com/apehex/tokun/blob/main/articles/tokun.md
194
+ [article-file-tokun-1]: https://github.com/apehex/tokun/blob/main/articles/tokun.1.md
195
+ [article-file-tokun-4]: https://github.com/apehex/tokun/blob/main/articles/tokun.4.md
196
+ [article-file-tokun-16]: https://github.com/apehex/tokun/blob/main/articles/tokun.16.md
197
+ [article-hugging-face]: https://huggingface.co/blog/apehex/tokenization-is-a-dead-weight
198
+ [article-notion-tokun-1]: https://apehex.notion.site/Tokun-1-e03c438a39fe49fcb2ce303eb63b2e73
199
+ [article-notion-tokun-4]: https://apehex.notion.site/Tokun-4-c8b4a3bd1270485a908287869553e9f2
200
+ [article-notion-tokun-16]: https://apehex.notion.site/Tokun-16-ecf35d5207ab401d85d3aa21d0b09538
201
+
202
+ [notebook-colab-tokun-1]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.1.ipynb
203
+ [notebook-colab-tokun-4]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.4.ipynb
204
+ [notebook-colab-tokun-16]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.16.ipynb
205
+ [notebook-colab-tokun-demo]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.demo.ipynb
206
+ [notebook-colab-tokun-train]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.train.ipynb
207
+ [notebook-file-tokun-1]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.1.ipynb
208
+ [notebook-file-tokun-4]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.4.ipynb
209
+ [notebook-file-tokun-16]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.16.ipynb
210
+ [notebook-file-tokun-demo]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.demo.ipynb
211
+ [notebook-file-tokun-train]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.train.ipynb
212
+ [notebook-hf-tokun-demo]: ../notebooks/tokun.demo.ipynb
213
+ [notebook-hf-tokun-train]: ../notebooks/tokun.train.ipynb
214
+ [notebook-kaggle-tokun-demo]: ../notebooks/tokun.demo.ipynb
215
+ [notebook-kaggle-tokun-train]: ../notebooks/tokun.train.ipynb
216
+
217
+ [youtube-karpathy-tokenizer]: https://www.youtube.com/watch?v=zduSFxRajkE
218
+