File size: 435 Bytes
fbec681
 
 
c898bc4
6f97955
c898bc4
1
2
3
4
5
6
---
{}
---
This is a very small uncased tokenizer for the [non-ascii version of TinyStories](https://huggingface.co/datasets/tdooms/TinyStories), based on the [original TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). I use a WordPiece tokenizer with a vocabulary of 2048.

The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories.