--- {} --- This is a very small uncased tokenizer for the [non-ascii version of TinyStories](https://huggingface.co/datasets/tdooms/TinyStories), based on the [original TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). I use a WordPiece tokenizer with a vocabulary of 2048. The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories.