File size: 371 Bytes
5c3cab8 f88466f 5c3cab8 f88466f 42631d6 f88466f 42631d6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
---
library_name: transformers
datasets:
- HuggingFaceTB/smollm-corpus
---
# Doge-tokenizer
Tokenizer for the training model on [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), and support reasoning fine-tuning like R1.
This tokenizer was trained on 2M samples from:
- FineWeb-Edu 70%
- Cosmopedia v2 20%
- Python-Edu 5%
- FineMath 5%
|