TinyStories-2048 / README.md
tdooms's picture
Upload tokenizer
fbec681 verified
|
raw
history blame contribute delete
No virus
435 Bytes
---
{}
---
This is a very small uncased tokenizer for the [non-ascii version of TinyStories](https://huggingface.co/datasets/tdooms/TinyStories), based on the [original TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). I use a WordPiece tokenizer with a vocabulary of 2048.
The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories.