apehex
/

tokun

TF-Keras

Model card Files Files and versions Community

apehex commited on Jun 24, 2024

Commit

3ec4efe

•

1 Parent(s): 24a7100

Update README.md

Browse files

Files changed (1) hide show

README.md +208 -5

README.md CHANGED Viewed

@@ -2,14 +2,217 @@
 library_name: keras
 ---
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed

 library_name: keras
 ---
+# tokun
+> `to-kun` took tokens to t-can
+Current tokenizers have notorious issues that are bringing all the LLMs down.
+`tokun` is a model specialized in text embedding.
+It is **lossless** while providing **high input compression**.
+`tokun` produces vectors of dimension 256 equivalent to 64 UTF-32-BE bytes.
+IE each embedding can be thought of as a *token of length 16 characters*.
+But these vectors are more than basic IDs, they keep meaningful information on their constituting parts.
+## Features
+The model produces vector embeddings that can be directly ingested by another model.
+Regular tokens are unrelated IDs, while `tokun` has the following properties:
+- **international**: `tokun` performs evenly on the whole Unicode space
+- **compression**: the sequence length is divided by 16
+- **embeddings**: the output vectors have only a dimension 256
+- **lossless**: embeddings store all the information up to the byte level
+- **built-ins**: Unicode has built-in special tokens, no need for `<|im_start|>`
+- **meaningful**: embeddings are natively related to each-other based on their parts
+## Installation
+In all cases, the model requires the code from the package `tokun`:
+```shell
+pip install tokun
+```
+### From Hugging Face
+Login to Hugging Face:
+```shell
+huggingface-cli login
+```
+Download the repository:
+```python
+import huggingface_hub as hh
+api = hh.HfApi()
+api.snapshot_download(repo_id='apehex/tokun', local_dir='tokun/')
+```
+Import the tokenizer and model:
+```python
+tokenizer = tokun.huggingface.ByteTokenizer()
+model = hh.from_pretrained_keras('tokun/variants/4x16/')
+```
+### With Base Tensorflow / Keras
+You can directly load the weights [from the repository](../models/).
+For the most performant variant of the model, `4x16`:
+```python
+import tensorflow as tf
+import tokun.model
+import urllib.request
+urllib.request.urlretrieve('https://github.com/apehex/tokun/raw/main/models/4x16/1/6.3.keras', 'model.keras')
+model = tf.keras.models.load_model('model.keras')
+```
+## Usage
+Since it is small (between 1 and 2M parameters depending on the variant), the model can also be [trained on Google Colab][notebook-file-tokun-train].
+We will be encoding and decoding the following sample:
+```python
+__s = """Une unité lexicale ou token lexical ou plus simplement token est un couple composé d'un nom et d'une valeur optionnelle (e.g. 135677)."""
+```
+### With Hugging Face
+The sequence dimension is fixed to 512 because exporting the Keras model requires to specify the input shape.
+So the sample is padded to `16 * 512` characters or `64 * 512` bytes.
+```python
+# encode with UTF-32
+__x = tokenizer.batch_encode_plus(batch_text_or_text_pairs=[__s], padding='max_length', max_length=64 * 512, add_special_tokens=False)
+__x = tf.convert_to_tensor(__x['input_ids'])
+# tokenize
+__e = model.layers[1](__x) # encoder
+# these embeddings would be the input of a LLM
+__o = llm(__e) # replace with your LLM
+# detokenize
+__p = model.layers[2](__o) # decoder
+# interpret probabilities as byte indexes
+__y = tokun.pipeline.postprocess(__p)
+```
+```python
+print(len(__s))
+# 252
+print(__x.shape) # 16 * 512 characters = 64 * 512 bytes
+# (1, 32768)
+print(__e.shape) # 512 embeddings
+# (1, 512, 256)
+print(__p.shape) # back to x shape
+# (1, 32768, 256)
+```
+> Note: the base Tensorflow implementation operates on any sequence dimension (see below)
+### With Base Tensorflow / Keras
+```python
+__x = tokun.pipeline.preprocess(text=__s, groups=[4, 16], expand=[1], flatten=True)
+__e = model._encoder(__x) # final embedding = input for another model
+# these embeddings would be the input of a LLM
+__o = llm(__e) # replace with your LLM
+# detokenize
+__p = MODEL._decoder(__o)
+# interpret probabilities as byte indexes
+__y = tokun.pipeline.postprocess(__p)
+```
+The OG version doesn't fix the sequence dimension:
+```python
+print(len(__s))
+# 252
+print(__x.shape) # 4 * 252 = 1008 padded to 1024 bytes
+# (1, 1024)
+print(__e.shape) # 252 / 16 = 1024 / 64 = 16
+# (1, 16, 256)
+print(__p.shape) # back to x shape
+# (1, 1024, 256)
+```
 ## Training and evaluation data
+`tokun` was **trained on random sequences** of UTF-32-BE bytes, so that it covers the first 4 planes of Unicode.
+Validation was also performed on the 7 languages of [MLQA][github-mlqa] to make sure the model keeps its accuracy on regular text.
+## Resources
+### Notebooks
+Final model:
+- train: [File][notebook-file-tokun-train] / [Colab][notebook-colab-tokun-train]
+- demo: [File][notebook-file-tokun-demo] / [Colab][notebook-colab-tokun-demo]
+Older / simpler model iterations:
+- `tokun-1`: [File][notebook-file-tokun-1] / [Colab][notebook-colab-tokun-1]
+- `tokun-4`: [File][notebook-file-tokun-4] / [Colab][notebook-colab-tokun-4]
+- `tokun-16`: [File][notebook-file-tokun-16] / [Colab][notebook-colab-tokun-16]
+### Articles
+Main article:
+- on [Github][article-file-tokun]
+- on [Hugging Face][article-hugging-face]
+Notes on each iteration:
+- `tokun-1`: [Github][article-file-tokun-1]
+- `tokun-4`: [Github][article-file-tokun-4]
+- `tokun-16`: [Github][article-file-tokun-16]
+## TODO
+See [TODO](TODO.md).
+## Credits
+This project was inspired by a video from Andrej Karpathy, ["Let's build the GPT tokenizer"][youtube-karpathy-tokenizer].
+## License
+Licensed under the [aGPLv3](LICENSE.md).
+[article-file-tokun]: https://github.com/apehex/tokun/blob/main/articles/tokun.md
+[article-file-tokun-1]: https://github.com/apehex/tokun/blob/main/articles/tokun.1.md
+[article-file-tokun-4]: https://github.com/apehex/tokun/blob/main/articles/tokun.4.md
+[article-file-tokun-16]: https://github.com/apehex/tokun/blob/main/articles/tokun.16.md
+[article-hugging-face]: https://huggingface.co/blog/apehex/tokenization-is-a-dead-weight
+[article-notion-tokun-1]: https://apehex.notion.site/Tokun-1-e03c438a39fe49fcb2ce303eb63b2e73
+[article-notion-tokun-4]: https://apehex.notion.site/Tokun-4-c8b4a3bd1270485a908287869553e9f2
+[article-notion-tokun-16]: https://apehex.notion.site/Tokun-16-ecf35d5207ab401d85d3aa21d0b09538
+[notebook-colab-tokun-1]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.1.ipynb
+[notebook-colab-tokun-4]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.4.ipynb
+[notebook-colab-tokun-16]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.16.ipynb
+[notebook-colab-tokun-demo]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.demo.ipynb
+[notebook-colab-tokun-train]: https://colab.research.google.com/github/apehex/tokun/blob/main/notebooks/tokun.train.ipynb
+[notebook-file-tokun-1]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.1.ipynb
+[notebook-file-tokun-4]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.4.ipynb
+[notebook-file-tokun-16]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.16.ipynb
+[notebook-file-tokun-demo]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.demo.ipynb
+[notebook-file-tokun-train]: https://github.com/apehex/tokun/blob/main/notebooks/tokun.train.ipynb
+[notebook-hf-tokun-demo]: ../notebooks/tokun.demo.ipynb
+[notebook-hf-tokun-train]: ../notebooks/tokun.train.ipynb
+[notebook-kaggle-tokun-demo]: ../notebooks/tokun.demo.ipynb
+[notebook-kaggle-tokun-train]: ../notebooks/tokun.train.ipynb
+[youtube-karpathy-tokenizer]: https://www.youtube.com/watch?v=zduSFxRajkE