YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Model
Pythia 160M (Biderman et al., 2023), autoregressive, GPT-NeoX based, with 12 attention heads and 12 layers.
Training
Training was done for 2 epochs, with a maximum sequence length of 2048 tokens, a learning rate of 1e-4, and an effective batch size of 4.
Training loss: 2.688
Validation loss: 2.54
Test loss: 2.497
Training data
Text-only English dataset of transcribed spontaneous speech, consisting of the following portions from the following works:
CHILDES (8M tokens) by MacWhinney (2000)
BNC Spoken (2.2M tokens)
Switchboard (1.9M tokens) by Godfrey et al. (1992)
CallHome (0.32M tokens) by Canavan et al. (1997)
CallFriend (0.07M tokens) by Canavan et al. (1996a and 1996b)
Total 12.5M tokens (approximately 10M words)
The portions from CHILDES, BNC Spoken and Switchboard were distributed by the BabyLM committee as part of the training data of the challenge (Warstadt et al., 2023), while CallHome and CallFriend were taken from TalkBank. The raw training data is not redistributed with this model; users wishing to reproduce the training setup should obtain each corpus through the official sources that have been linked.
References
Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., ... & Van Der Wal, O. (2023, July). Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning (pp. 2397-2430). PMLR.
Canavan, A., Graff, D., & Zipperlen, G. (1997). Callhome american english speech. Linguistic Data Consortium.
Canavan, A., & Zipperlen, G. (1996a). CALLFRIEND American English-Non-Southern Dialect LDC96S46. Philadelphia: Linguistic Data Consortium.
Canavan, A., & Zipperlen, G. (1996b). CALLFRIEND American English-Southern Dialect LDC96S47. Philadelphia: Linguistic Data Consortium.
Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992, March). SWITCHBOARD: Telephone speech corpus for research and development. In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 1, pp. 517-520). IEEE.
MacWhinney, B. (2000). The CHILDES project: The database (Vol. 2). Psychology Press.
Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., ... & Cotterell, R. (2023, December). Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the babylm challenge at the 27th conference on computational natural language learning (pp. 1-34).
- Downloads last month
- 36