Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Introduction

In this work, we pre-train RoBERTa base on various word shuffled variants of BookWiki corpus (16GB). We observe that a word shuffled pre-trained model achieves surprisingly good scores on GLUE, PAWS and several parametric probing tasks. Please read our paper for more details on the experiments.

Pre-trained models

Model	Description	Download
`roberta.base.orig`	RoBERTa (base) trained on natural corpus	roberta.base.orig.tar.gz
`roberta.base.shuffle.n1`	RoBERTa (base) trained on n=1 gram sentence word shuffled data	roberta.base.shuffle.n1.tar.gz
`roberta.base.shuffle.n2`	RoBERTa (base) trained on n=2 gram sentence word shuffled data	roberta.base.shuffle.n2.tar.gz
`roberta.base.shuffle.n3`	RoBERTa (base) trained on n=3 gram sentence word shuffled data	roberta.base.shuffle.n3.tar.gz
`roberta.base.shuffle.n4`	RoBERTa (base) trained on n=4 gram sentence word shuffled data	roberta.base.shuffle.n4.tar.gz
`roberta.base.shuffle.512`	RoBERTa (base) trained on unigram 512 word block shuffled data	roberta.base.shuffle.512.tar.gz
`roberta.base.shuffle.corpus`	RoBERTa (base) trained on unigram corpus word shuffled data	roberta.base.shuffle.corpus.tar.gz
`roberta.base.shuffle.corpus_uniform`	RoBERTa (base) trained on unigram corpus word shuffled data, where all words are uniformly sampled	roberta.base.shuffle.corpus_uniform.tar.gz
`roberta.base.nopos`	RoBERTa (base) without positional embeddings, trained on natural corpus	roberta.base.nopos.tar.gz

Results

GLUE (Wang et al, 2019) & PAWS (Zhang et al, 2019) (dev set, single model, single-task fine-tuning, median of 5 seeds)

name	CoLA	MNLI	MRPC	PAWS	QNLI	QQP	RTE	SST-2
`roberta.base.orig`	61.4	86.11	89.19	94.46	92.53	91.26	74.64	93.92
`roberta.base.shuffle.n1`	35.15	82.64	86	89.97	89.02	91.01	69.02	90.47
`roberta.base.shuffle.n2`	54.37	83.43	86.24	93.46	90.44	91.36	70.83	91.79
`roberta.base.shuffle.n3`	48.72	83.85	86.36	94.05	91.69	91.24	70.65	92.02
`roberta.base.shuffle.n4`	58.64	83.77	86.98	94.32	91.69	91.4	70.83	92.48
`roberta.base.shuffle.512`	12.76	77.52	79.61	84.77	85.19	90.2	56.52	86.34
`roberta.base.shuffle.corpus`	0	71.9	70.52	58.52	71.11	85.52	53.99	83.35
`roberta.base.shuffle.corpus_random`	9.19	72.33	70.76	58.42	77.76	85.93	53.99	84.04
`roberta.base.nopos`	0	63.5	72.73	57.08	77.72	87.87	54.35	83.24

For more results on probing tasks, please refer to our paper.

Example Usage

Follow the same usage as in RoBERTa to load and test your models:

# Download roberta.base.shuffle.n1 model
wget https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n1.tar.gz
tar -xzvf roberta.base.shuffle.n1.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import RoBERTaModel
roberta = RoBERTaModel.from_pretrained('/path/to/roberta.base.shuffle.n1', checkpoint_file='model.pt')
roberta.eval()  # disable dropout (or leave in train mode to finetune)

Note: The model trained without positional embeddings (roberta.base.nopos) is a modified RoBERTa model, where the positional embeddings are not used. Thus, the typical from_pretrained method on fairseq version of RoBERTa will not be able to load the above model weights. To do so, construct a new RoBERTaModel object by setting the flag use_positional_embeddings to False (or in the latest code, set no_token_positional_embeddings to True), and then load the individual weights.

Fine-tuning Evaluation

We provide the trained fine-tuned models on MNLI here for each model above for quick evaluation (1 seed for each model). Please refer to finetuning details for the parameters of these models. Follow RoBERTa instructions to evaluate these models.

Model	MNLI M Dev Accuracy	Link
`roberta.base.orig.mnli`	86.14	Download
`roberta.base.shuffle.n1.mnli`	82.55	Download
`roberta.base.shuffle.n2.mnli`	83.21	Download
`roberta.base.shuffle.n3.mnli`	83.89	Download
`roberta.base.shuffle.n4.mnli`	84.00	Download
`roberta.base.shuffle.512.mnli`	77.22	Download
`roberta.base.shuffle.corpus.mnli`	71.88	Download
`roberta.base.shuffle.corpus_uniform.mnli`	72.46	Download

Citation

@misc{sinha2021masked,
      title={Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little},
      author={Koustuv Sinha and Robin Jia and Dieuwke Hupkes and Joelle Pineau and Adina Williams and Douwe Kiela},
      year={2021},
      eprint={2104.06644},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}