JustinLin610
update
10b0761

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

https://arxiv.org/abs/2104.06644

Introduction

In this work, we pre-train RoBERTa base on various word shuffled variants of BookWiki corpus (16GB). We observe that a word shuffled pre-trained model achieves surprisingly good scores on GLUE, PAWS and several parametric probing tasks. Please read our paper for more details on the experiments.

Pre-trained models

Model Description Download
roberta.base.orig RoBERTa (base) trained on natural corpus roberta.base.orig.tar.gz
roberta.base.shuffle.n1 RoBERTa (base) trained on n=1 gram sentence word shuffled data roberta.base.shuffle.n1.tar.gz
roberta.base.shuffle.n2 RoBERTa (base) trained on n=2 gram sentence word shuffled data roberta.base.shuffle.n2.tar.gz
roberta.base.shuffle.n3 RoBERTa (base) trained on n=3 gram sentence word shuffled data roberta.base.shuffle.n3.tar.gz
roberta.base.shuffle.n4 RoBERTa (base) trained on n=4 gram sentence word shuffled data roberta.base.shuffle.n4.tar.gz
roberta.base.shuffle.512 RoBERTa (base) trained on unigram 512 word block shuffled data roberta.base.shuffle.512.tar.gz
roberta.base.shuffle.corpus RoBERTa (base) trained on unigram corpus word shuffled data roberta.base.shuffle.corpus.tar.gz
roberta.base.shuffle.corpus_uniform RoBERTa (base) trained on unigram corpus word shuffled data, where all words are uniformly sampled roberta.base.shuffle.corpus_uniform.tar.gz
roberta.base.nopos RoBERTa (base) without positional embeddings, trained on natural corpus roberta.base.nopos.tar.gz

Results

GLUE (Wang et al, 2019) & PAWS (Zhang et al, 2019) (dev set, single model, single-task fine-tuning, median of 5 seeds)

name CoLA MNLI MRPC PAWS QNLI QQP RTE SST-2
roberta.base.orig 61.4 86.11 89.19 94.46 92.53 91.26 74.64 93.92
roberta.base.shuffle.n1 35.15 82.64 86 89.97 89.02 91.01 69.02 90.47
roberta.base.shuffle.n2 54.37 83.43 86.24 93.46 90.44 91.36 70.83 91.79
roberta.base.shuffle.n3 48.72 83.85 86.36 94.05 91.69 91.24 70.65 92.02
roberta.base.shuffle.n4 58.64 83.77 86.98 94.32 91.69 91.4 70.83 92.48
roberta.base.shuffle.512 12.76 77.52 79.61 84.77 85.19 90.2 56.52 86.34
roberta.base.shuffle.corpus 0 71.9 70.52 58.52 71.11 85.52 53.99 83.35
roberta.base.shuffle.corpus_random 9.19 72.33 70.76 58.42 77.76 85.93 53.99 84.04
roberta.base.nopos 0 63.5 72.73 57.08 77.72 87.87 54.35 83.24

For more results on probing tasks, please refer to our paper.

Example Usage

Follow the same usage as in RoBERTa to load and test your models:

# Download roberta.base.shuffle.n1 model
wget https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n1.tar.gz
tar -xzvf roberta.base.shuffle.n1.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import RoBERTaModel
roberta = RoBERTaModel.from_pretrained('/path/to/roberta.base.shuffle.n1', checkpoint_file='model.pt')
roberta.eval()  # disable dropout (or leave in train mode to finetune)

Note: The model trained without positional embeddings (roberta.base.nopos) is a modified RoBERTa model, where the positional embeddings are not used. Thus, the typical from_pretrained method on fairseq version of RoBERTa will not be able to load the above model weights. To do so, construct a new RoBERTaModel object by setting the flag use_positional_embeddings to False (or in the latest code, set no_token_positional_embeddings to True), and then load the individual weights.

Fine-tuning Evaluation

We provide the trained fine-tuned models on MNLI here for each model above for quick evaluation (1 seed for each model). Please refer to finetuning details for the parameters of these models. Follow RoBERTa instructions to evaluate these models.

Model MNLI M Dev Accuracy Link
roberta.base.orig.mnli 86.14 Download
roberta.base.shuffle.n1.mnli 82.55 Download
roberta.base.shuffle.n2.mnli 83.21 Download
roberta.base.shuffle.n3.mnli 83.89 Download
roberta.base.shuffle.n4.mnli 84.00 Download
roberta.base.shuffle.512.mnli 77.22 Download
roberta.base.shuffle.corpus.mnli 71.88 Download
roberta.base.shuffle.corpus_uniform.mnli 72.46 Download

Citation

@misc{sinha2021masked,
      title={Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little},
      author={Koustuv Sinha and Robin Jia and Dieuwke Hupkes and Joelle Pineau and Adina Williams and Douwe Kiela},
      year={2021},
      eprint={2104.06644},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}