roberta-medium / README.md
JackBAI's picture
Update README.md
2296863
|
raw
history blame
2.48 kB
metadata
license: mit
datasets:
  - wikipedia
  - bookcorpus
language:
  - en
metrics:
  - glue
library_name: transformers

This is our reproduction using the official HuggingFace roberta architecture with a medium size. On the architecture side, RoBERTa is exactly the same as BERT except for its larger vocabulary size.

According to Google's BERT releases and BERT-Medium, a medium sized model should have a config of Layer=8, Hidden=512, #AttnHeads=8, and IntermediateSize=2048. We follow this config to pre-train a RoBERTa-base model for reproduction.

We use the same datasets as BERT (English Wikipedia and Book Corpus) to pre-train for 30k steps with a batch size of 8,192. I also released the reproduction of this dataset on HuggingFace.

We utilized DeepSpeed ZeRO-2 for performance optimization.

Other training configuration:

Parameter Value
WARMUP_STEPS 1800
LR_DECAY linear
ADAM_EPS 1e-6
ADAM_BETA1 0.9
ADAM_BETA2 0.98
ADAM_WEIGHT_DECAY 0.01
PEAK_LR 1e-3

We achieve very similar performance as the official BERT-Medium release on GLUE:

Model MRPC-F1 STS-B-Pearson SST-2-Acc QQP-F1 MNLI-m MNLI-mm QNLI-Acc WNLI-Acc RTE-Acc
RoBERTa-medium (ours) 83.6 82.7 89.7 89.0 79.7 80.1 89.3 31.0 57.4
BERT-medium 86.3 87.7 88.9 89.4 80.6 81.0 89.2 29.6 63.9

Evaluation Scores Curve (AVG of scores) during pretraining:

image/png

For both stats above we don't report CoLA scores as it's pretty unstable. The raw CoLA scores are:

Step 1500 3000 6000 9000 13500 18000 24000 30000
CoLA 1.7 13.5 29.2 31.4 31.1 24.1 29.0 20.0