Edit model card

This is our reproduction using the official HuggingFace roberta architecture with a medium size. On the architecture side, RoBERTa is exactly the same as BERT except for its larger vocabulary size.

According to Google's BERT releases and BERT-Medium, a medium sized model should have a config of Layer=8, Hidden=512, #AttnHeads=8, and IntermediateSize=2048. We follow this config to pre-train a RoBERTa-base model for reproduction.

We use the same datasets as BERT (English Wikipedia and Book Corpus) to pre-train for 30k steps with a batch size of 8,192. I also released the reproduction of this dataset on HuggingFace.

We utilized DeepSpeed ZeRO-2 for performance optimization.

Other training configuration:

Parameter Value
WARMUP_STEPS 1800
LR_DECAY linear
ADAM_EPS 1e-6
ADAM_BETA1 0.9
ADAM_BETA2 0.98
ADAM_WEIGHT_DECAY 0.01
PEAK_LR 1e-3

We achieve very similar performance as the official BERT-Medium release on GLUE:

Model MRPC-F1 STS-B-Pearson SST-2-Acc QQP-F1 MNLI-m MNLI-mm QNLI-Acc WNLI-Acc RTE-Acc
RoBERTa-medium (ours) 83.6 82.7 89.7 89.0 79.7 80.1 89.3 31.0 57.4
BERT-medium 86.3 87.7 88.9 89.4 80.6 81.0 89.2 29.6 63.9

Evaluation Scores Curve (AVG of scores) during pretraining:

image/png

For both stats above we don't report CoLA scores as it's pretty unstable. The raw CoLA scores are:

Step 1500 3000 6000 9000 13500 18000 24000 30000
CoLA 1.7 13.5 29.2 31.4 31.1 24.1 29.0 20.0
Downloads last month
26
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train JackBAI/roberta-medium