RoBERTa Pretrained on Smaller Datasets

We pretrain RoBERTa on smaller datasets (1M, 10M, 100M, 1B tokens). We release 3 models with lowest perplexities for each pretraining data size out of 25 runs (or 10 in the case of 1B tokens). The pretraining data reproduces that of BERT: We combine English Wikipedia and a reproduction of BookCorpus using texts from smashwords in a ratio of approximately 3:1.

Hyperparameters and Validation Perplexity

The hyperparameters and validation perplexities corresponding to each model are as follows:

Model Name Training Size Model Size Max Steps Batch Size Validation Perplexity
roberta-base-1B-1 1B BASE 100K 512 3.93
roberta-base-1B-2 1B BASE 31K 1024 4.25
roberta-base-1B-3 1B BASE 31K 4096 3.84
roberta-base-100M-1 100M BASE 100K 512 4.99
roberta-base-100M-2 100M BASE 31K 1024 4.61
roberta-base-100M-3 100M BASE 31K 512 5.02
roberta-base-10M-1 10M BASE 10K 1024 11.31
roberta-base-10M-2 10M BASE 10K 512 10.78
roberta-base-10M-3 10M BASE 31K 512 11.58
roberta-med-small-1M-1 1M MED-SMALL 100K 512 153.38
roberta-med-small-1M-2 1M MED-SMALL 10K 512 134.18
roberta-med-small-1M-3 1M MED-SMALL 31K 512 139.39

The hyperparameters corresponding to model sizes mentioned above are as follows:

Model Size L AH HS FFN P
BASE 12 12 768 3072 125M
MED-SMALL 6 8 512 2048 45M

(AH = number of attention heads; HS = hidden size; FFN = feedforward network dimension; P = number of parameters.)

For other hyperparameters, we select:

  • Peak Learning rate: 5e-4
  • Warmup Steps: 6% of max steps
  • Dropout: 0.1
Downloads last month
Hosted inference API
Fill Mask

Mask token: <mask>