Tokenisers
Collection
A collection of tokenisers I have trained (so you don't have to).
•
2 items
•
Updated
ULM tokeniser with vocabulary size 32768, trained on the first 3 million examples in SlimPajama-627B.
ULM trainer implementation:
SentencePieceTrainer
.KudoPieceTrainer
Preprocessor:
SentencePiecePreprocessor
ModernEnglishPreprocessor
Ġ
Time: 3h40m
Memory: 257 GiB peak usage (i.e. about 80 GiB RAM per million sentences).
Data sizes: