YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model

Pythia 160M (Biderman et al., 2023), autoregressive, GPT-NeoX based, with 12 attention heads and 12 layers.

Training

Training was done for 2 epochs, with a maximum sequence length of 2048 tokens, a learning rate of 1e-4, and an effective batch size of 4.

Training loss: 3.3809

Validation loss: 3.5118

Test loss: 3.5431

Training data

Text-only English dataset of non-conversational text, comprising the following corpora:

FineWeb-Edu (5M tokens) by Penedo et al. (2024)

Simple Wikipedia (5M tokens) from Wikipedia Dump

KidLM (2.5M tokens) by Nayeem and Rafiei (2024)

Total 12.5M tokens (approximately 10M words)

All corpora are publicly available for research purposes. The raw training data is not redistributed with this model; users wishing to reproduce the training setup should obtain each corpus through the official sources linked above.

References

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., ... & Van Der Wal, O. (2023, July). Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning (pp. 2397-2430). PMLR.

Nayeem, M. T., & Rafiei, D. (2024, November). KidLM: Advancing language models for children–early insights and future directions. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 4813-4836).

Penedo, G., Kydlíček, H., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., & Wolf, T. (2024). The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37, 30811-30849.

Downloads last month
33
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including anonymous-sub1/no-dialog-model