RoBERTa Pretrained on Smaller Datasets

We pretrain RoBERTa on smaller datasets (1M, 10M, 100M, 1B tokens). We release 3 models with lowest perplexities for each pretraining data size out of 25 runs (or 10 in the case of 1B tokens). The pretraining data reproduces that of BERT: We combine English Wikipedia and a reproduction of BookCorpus using texts from smashwords in a ratio of approximately 3:1.

Hyperparameters and Validation Perplexity

The hyperparameters and validation perplexities corresponding to each model are as follows:

Model Name	Training Size	Model Size	Max Steps	Batch Size	Validation Perplexity
roberta-base-1B-1	1B	BASE	100K	512	3.93
roberta-base-1B-2	1B	BASE	31K	1024	4.25
roberta-base-1B-3	1B	BASE	31K	4096	3.84
roberta-base-100M-1	100M	BASE	100K	512	4.99
roberta-base-100M-2	100M	BASE	31K	1024	4.61
roberta-base-100M-3	100M	BASE	31K	512	5.02
roberta-base-10M-1	10M	BASE	10K	1024	11.31
roberta-base-10M-2	10M	BASE	10K	512	10.78
roberta-base-10M-3	10M	BASE	31K	512	11.58
roberta-med-small-1M-1	1M	MED-SMALL	100K	512	153.38
roberta-med-small-1M-2	1M	MED-SMALL	10K	512	134.18
roberta-med-small-1M-3	1M	MED-SMALL	31K	512	139.39

The hyperparameters corresponding to model sizes mentioned above are as follows:

Model Size	L	AH	HS	FFN	P
BASE	12	12	768	3072	125M
MED-SMALL	6	8	512	2048	45M

(AH = number of attention heads; HS = hidden size; FFN = feedforward network dimension; P = number of parameters.)

For other hyperparameters, we select:

Peak Learning rate: 5e-4
Warmup Steps: 6% of max steps
Dropout: 0.1