language: en license: bsd datasets: - bookcorpus - wikipedia

SqueezeBERT pretrained model

This model, squeezebert-mnli-headless, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the Multi-Genre Natural Language Inference (MNLI) dataset. This is a "headless" model with the final classification layer removed, and this will allow Transformers to automatically reinitialize the final classification layer before you begin finetuning on your data. SqueezeBERT was introduced in this paper. This model is case-insensitive. The model architecture is similar to BERT-base, but with the pointwise fully-connected layers replaced with grouped convolutions. The authors found that SqueezeBERT is 4.3x faster than bert-base-uncased on a Google Pixel 3 smartphone.

Pretraining

Pretraining data

BookCorpus, a dataset consisting of thousands of unpublished books
English Wikipedia

Pretraining procedure

The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks. (Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)

From the SqueezeBERT paper:

We pretrain SqueezeBERT from scratch (without distillation) using the LAMB optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e-3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.

Finetuning

The SqueezeBERT paper presents 2 approaches to finetuning the model:

"finetuning without bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on each GLUE task
"finetuning with bells and whistles" -- after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLI-finetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a task-specific teacher model.

A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the SqueezeBERT paper. Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola - forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.

This model, squeezebert/squeezebert-mnli-headless, is the "finetuned with bells and whistles" MNLI-finetuned SqueezeBERT model. In this particular model, we have removed the final classification layer -- in other words, it is "headless." We recommend using this model if you intend to finetune the model on your own data. Using this model means that your final layer will automatically be reinitialized when you start finetuning on your data.

How to finetune

To try finetuning SqueezeBERT on the MRPC text classification task, you can run the following command:

./utils/download_glue_data.py

python examples/text-classification/run_glue.py \
    --model_name_or_path squeezebert-base-headless \
    --task_name mrpc \
    --data_dir ./glue_data/MRPC \
    --output_dir ./models/squeezebert_mrpc \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --num_train_epochs 10 \
    --learning_rate 3e-05 \
    --per_device_train_batch_size 16 \
    --save_steps 20000

BibTeX entry and citation info

@article{2020_SqueezeBERT,
     author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},
     title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},
     journal = {arXiv:2006.11316},
     year = {2020}
}