TinyStories-2048 / README.md
tdooms's picture
Upload tokenizer
fbec681 verified
metadata
{}

This is a very small uncased tokenizer for the non-ascii version of TinyStories, based on the original TinyStories dataset. I use a WordPiece tokenizer with a vocabulary of 2048.

The tokenizer is strictly fitted to the mentioned dataset and probably won't work well in any context outside of children's stories.