YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Model
Pythia 160M (Biderman et al., 2023), autoregressive, GPT-NeoX based, with 12 attention heads and 12 layers.
Training
Training was done for 2 epochs, with a maximum sequence length of 2048 tokens, a learning rate of 1e-4, and an effective batch size of 4.
Training loss: 3.3809
Validation loss: 3.5118
Test loss: 3.5431
Training data
Text-only English dataset of non-conversational text, comprising the following corpora:
FineWeb-Edu (5M tokens) by Penedo et al. (2024)
Simple Wikipedia (5M tokens) from Wikipedia Dump
KidLM (2.5M tokens) by Nayeem and Rafiei (2024)
Total 12.5M tokens (approximately 10M words)
All corpora are publicly available for research purposes. The raw training data is not redistributed with this model; users wishing to reproduce the training setup should obtain each corpus through the official sources linked above.
References
Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., ... & Van Der Wal, O. (2023, July). Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning (pp. 2397-2430). PMLR.
Nayeem, M. T., & Rafiei, D. (2024, November). KidLM: Advancing language models for children–early insights and future directions. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 4813-4836).
Penedo, G., Kydlíček, H., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., & Wolf, T. (2024). The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37, 30811-30849.
- Downloads last month
- 33