Add files using upload-large-folder tool

77d5bb2 verified 13 days ago

9.7 kB

	# Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

	[https://arxiv.org/abs/2104.06644](https://arxiv.org/abs/2104.06644)

	## Introduction

	In this work, we pre-train [RoBERTa](../roberta) base on various word shuffled variants of BookWiki corpus (16GB). We observe that a word shuffled pre-trained model achieves surprisingly good scores on GLUE, PAWS and several parametric probing tasks. Please read our paper for more details on the experiments.

	## Pre-trained models

	\| Model \| Description \| Download \|
	\| ------------------------------------- \| -------------------------------------------------------------------------------------------------- \| --------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| `roberta.base.orig` \| RoBERTa (base) trained on natural corpus \| [roberta.base.orig.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.orig.tar.gz) \|
	\| `roberta.base.shuffle.n1` \| RoBERTa (base) trained on n=1 gram sentence word shuffled data \| [roberta.base.shuffle.n1.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n1.tar.gz) \|
	\| `roberta.base.shuffle.n2` \| RoBERTa (base) trained on n=2 gram sentence word shuffled data \| [roberta.base.shuffle.n2.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n2.tar.gz) \|
	\| `roberta.base.shuffle.n3` \| RoBERTa (base) trained on n=3 gram sentence word shuffled data \| [roberta.base.shuffle.n3.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n3.tar.gz) \|
	\| `roberta.base.shuffle.n4` \| RoBERTa (base) trained on n=4 gram sentence word shuffled data \| [roberta.base.shuffle.n4.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n4.tar.gz) \|
	\| `roberta.base.shuffle.512` \| RoBERTa (base) trained on unigram 512 word block shuffled data \| [roberta.base.shuffle.512.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.512.tar.gz) \|
	\| `roberta.base.shuffle.corpus` \| RoBERTa (base) trained on unigram corpus word shuffled data \| [roberta.base.shuffle.corpus.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.corpus.tar.gz) \|
	\| `roberta.base.shuffle.corpus_uniform` \| RoBERTa (base) trained on unigram corpus word shuffled data, where all words are uniformly sampled \| [roberta.base.shuffle.corpus_uniform.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.corpus_uniform.tar.gz) \|
	\| `roberta.base.nopos` \| RoBERTa (base) without positional embeddings, trained on natural corpus \| [roberta.base.nopos.tar.gz](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.nopos.tar.gz) \|

	## Results

	[GLUE (Wang et al, 2019)](https://gluebenchmark.com/) & [PAWS (Zhang et al, 2019)](https://github.com/google-research-datasets/paws) _(dev set, single model, single-task fine-tuning, median of 5 seeds)_

	\| name \| CoLA \| MNLI \| MRPC \| PAWS \| QNLI \| QQP \| RTE \| SST-2 \|
	\| :----------------------------------- \| ----: \| ----: \| ----: \| ----: \| ----: \| ----: \| ----: \| ----: \|
	\| `roberta.base.orig` \| 61.4 \| 86.11 \| 89.19 \| 94.46 \| 92.53 \| 91.26 \| 74.64 \| 93.92 \|
	\| `roberta.base.shuffle.n1` \| 35.15 \| 82.64 \| 86 \| 89.97 \| 89.02 \| 91.01 \| 69.02 \| 90.47 \|
	\| `roberta.base.shuffle.n2` \| 54.37 \| 83.43 \| 86.24 \| 93.46 \| 90.44 \| 91.36 \| 70.83 \| 91.79 \|
	\| `roberta.base.shuffle.n3` \| 48.72 \| 83.85 \| 86.36 \| 94.05 \| 91.69 \| 91.24 \| 70.65 \| 92.02 \|
	\| `roberta.base.shuffle.n4` \| 58.64 \| 83.77 \| 86.98 \| 94.32 \| 91.69 \| 91.4 \| 70.83 \| 92.48 \|
	\| `roberta.base.shuffle.512` \| 12.76 \| 77.52 \| 79.61 \| 84.77 \| 85.19 \| 90.2 \| 56.52 \| 86.34 \|
	\| `roberta.base.shuffle.corpus` \| 0 \| 71.9 \| 70.52 \| 58.52 \| 71.11 \| 85.52 \| 53.99 \| 83.35 \|
	\| `roberta.base.shuffle.corpus_random` \| 9.19 \| 72.33 \| 70.76 \| 58.42 \| 77.76 \| 85.93 \| 53.99 \| 84.04 \|
	\| `roberta.base.nopos` \| 0 \| 63.5 \| 72.73 \| 57.08 \| 77.72 \| 87.87 \| 54.35 \| 83.24 \|

	For more results on probing tasks, please refer to [our paper](https://arxiv.org/abs/2104.06644).

	## Example Usage

	Follow the same usage as in [RoBERTa](https://github.com/pytorch/fairseq/tree/main/examples/roberta) to load and test your models:

	```python
	# Download roberta.base.shuffle.n1 model
	wget https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n1.tar.gz
	tar -xzvf roberta.base.shuffle.n1.tar.gz
	# Copy the dictionary files
	cd roberta.base.shuffle.n1.tar.gz
	wget -O dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt && wget -O encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json && wget -O vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
	cd ..

	# Load the model in fairseq
	from fairseq.models.roberta import RobertaModel
	roberta = RobertaModel.from_pretrained('/path/to/roberta.base.shuffle.n1', checkpoint_file='model.pt')
	roberta.eval() # disable dropout (or leave in train mode to finetune)
	```

	We have also provided a [Google Colab](https://colab.research.google.com/drive/1IJDVfNVWdvRfLjphQKBGzmob84t-OXpm) notebook to demonstrate the loading of the model. The models were trained on top of Fairseq from the following commit: [62cff008ebeeed855093837507d5e6bf52065ee6](https://github.com/pytorch/fairseq/commit/62cff008ebeeed855093837507d5e6bf52065ee6).

	Note: The model trained without positional embeddings (`roberta.base.nopos`) is a modified `RoBERTa` model, where the positional embeddings are not used. Thus, the typical `from_pretrained` method on fairseq version of RoBERTa will not be able to load the above model weights. To do so, construct a new `RoBERTaModel` object by setting the flag `use_positional_embeddings` to `False` (or [in the latest code](https://github.com/pytorch/fairseq/blob/main/fairseq/models/roberta/model.py#L543), set `no_token_positional_embeddings` to `True`), and then load the individual weights.

	## Fine-tuning Evaluation

	We provide the trained fine-tuned models on MNLI here for each model above for quick evaluation (1 seed for each model). Please refer to [finetuning details](README.finetuning.md) for the parameters of these models. Follow [RoBERTa](https://github.com/pytorch/fairseq/tree/main/examples/roberta) instructions to evaluate these models.

	\| Model \| MNLI M Dev Accuracy \| Link \|
	\| :----------------------------------------- \| :------------------ \| :--------------------------------------------------------------------------------------------------------------- \|
	\| `roberta.base.orig.mnli` \| 86.14 \| [Download](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.orig.mnli.tar.gz) \|
	\| `roberta.base.shuffle.n1.mnli` \| 82.55 \| [Download](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n1.mnli.tar.gz) \|
	\| `roberta.base.shuffle.n2.mnli` \| 83.21 \| [Download](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n2.mnli.tar.gz) \|
	\| `roberta.base.shuffle.n3.mnli` \| 83.89 \| [Download](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n3.mnli.tar.gz) \|
	\| `roberta.base.shuffle.n4.mnli` \| 84.00 \| [Download](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.n4.mnli.tar.gz) \|
	\| `roberta.base.shuffle.512.mnli` \| 77.22 \| [Download](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.512.mnli.tar.gz) \|
	\| `roberta.base.shuffle.corpus.mnli` \| 71.88 \| [Download](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.corpus.mnli.tar.gz) \|
	\| `roberta.base.shuffle.corpus_uniform.mnli` \| 72.46 \| [Download](https://dl.fbaipublicfiles.com/unnatural_pretraining/roberta.base.shuffle.corpus_uniform.mnli.tar.gz) \|

	## Citation

	```bibtex
	@misc{sinha2021masked,
	title={Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little},
	author={Koustuv Sinha and Robin Jia and Dieuwke Hupkes and Joelle Pineau and Adina Williams and Douwe Kiela},
	year={2021},
	eprint={2104.06644},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	## Contact

	For questions and comments, please reach out to Koustuv Sinha (koustuv.sinha@mail.mcgill.ca).