Model

Pythia 160M (Biderman et al., 2023), autoregressive, GPT-NeoX based, with 12 attention heads and 12 layers.

Training

Training was done for 2 epochs, with a maximum sequence length of 2048 tokens, a learning rate of 1e-4, and an effective batch size of 4.

Training loss: 3.3809

Validation loss: 3.5118

Test loss: 3.5431

Training data

Text-only English dataset of non-conversational text, comprising the following corpora:

FineWeb-Edu (5M tokens) by Penedo et al. (2024)

Simple Wikipedia (5M tokens) from Wikipedia Dump

KidLM (2.5M tokens) by Nayeem and Rafiei (2024)

Total 12.5M tokens (approximately 10M words)

All corpora are publicly available for research purposes. The raw training data is not redistributed with this model; users wishing to reproduce the training setup should obtain each corpus through the official sources linked above.

References

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., ... & Van Der Wal, O. (2023, July). Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning (pp. 2397-2430). PMLR.

Nayeem, M. T., & Rafiei, D. (2024, November). KidLM: Advancing language models for children–early insights and future directions. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 4813-4836).

Penedo, G., Kydlíček, H., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., & Wolf, T. (2024). The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37, 30811-30849.

Downloads last month: 33

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including anonymous-sub1/no-dialog-model

models

Collection

4 items • Updated 19 days ago