Model

Pythia 160M (Biderman et al., 2023), autoregressive, GPT-NeoX based, with 12 attention heads and 12 layers.

Training

Training was done for 2 epochs, with a maximum sequence length of 2048 tokens, a learning rate of 1e-4, and an effective batch size of 4.

Training loss: 2.688

Validation loss: 2.54

Test loss: 2.497

Training data

Text-only English dataset of transcribed spontaneous speech, consisting of the following portions from the following works:

CHILDES (8M tokens) by MacWhinney (2000)

BNC Spoken (2.2M tokens)

Switchboard (1.9M tokens) by Godfrey et al. (1992)

CallHome (0.32M tokens) by Canavan et al. (1997)

CallFriend (0.07M tokens) by Canavan et al. (1996a and 1996b)

Total 12.5M tokens (approximately 10M words)

The portions from CHILDES, BNC Spoken and Switchboard were distributed by the BabyLM committee as part of the training data of the challenge (Warstadt et al., 2023), while CallHome and CallFriend were taken from TalkBank. The raw training data is not redistributed with this model; users wishing to reproduce the training setup should obtain each corpus through the official sources that have been linked.

References

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., ... & Van Der Wal, O. (2023, July). Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning (pp. 2397-2430). PMLR.

Canavan, A., Graff, D., & Zipperlen, G. (1997). Callhome american english speech. Linguistic Data Consortium.

Canavan, A., & Zipperlen, G. (1996a). CALLFRIEND American English-Non-Southern Dialect LDC96S46. Philadelphia: Linguistic Data Consortium.

Canavan, A., & Zipperlen, G. (1996b). CALLFRIEND American English-Southern Dialect LDC96S47. Philadelphia: Linguistic Data Consortium.

Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992, March). SWITCHBOARD: Telephone speech corpus for research and development. In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 1, pp. 517-520). IEEE.

MacWhinney, B. (2000). The CHILDES project: The database (Vol. 2). Psychology Press.

Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., ... & Cotterell, R. (2023, December). Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the babylm challenge at the 27th conference on computational natural language learning (pp. 1-34).

Downloads last month: 36

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including anonymous-sub1/spontaneous-speech-model

models

Collection

4 items • Updated 19 days ago