File size: 435 Bytes
fbec681 c898bc4 6f97955 c898bc4 |
1 2 3 4 5 6 |
---
{}
---
This is a very small uncased tokenizer for the [non-ascii version of TinyStories](https://huggingface.co/datasets/tdooms/TinyStories), based on the [original TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). I use a WordPiece tokenizer with a vocabulary of 2048.
The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories. |