Migrate model card from transformersrepo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcementallmodelcardswillbemigratedtohfcomodelrepos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/squeezebert/squeezebertmnliheadless/README.md
README.md
ADDED
@@ 0,0 +1,67 @@







































































































































1 
+
language: en

2 
+
license: bsd

3 
+
datasets:

4 
+
 bookcorpus

5 
+
 wikipedia

6 
+


7 
+

8 
+
# SqueezeBERT pretrained model

9 
+

10 
+
This model, `squeezebertmnliheadless`, has been pretrained for the English language using a masked language modeling (MLM) and Sentence Order Prediction (SOP) objective and finetuned on the [MultiGenre Natural Language Inference (MNLI)](https://cims.nyu.edu/~sbowman/multinli/) dataset. This is a "headless" model with the final classification layer removed, and this will allow Transformers to automatically reinitialize the final classification layer before you begin finetuning on your data.

11 
+
SqueezeBERT was introduced in [this paper](https://arxiv.org/abs/2006.11316). This model is caseinsensitive. The model architecture is similar to BERTbase, but with the pointwise fullyconnected layers replaced with [grouped convolutions](https://blog.yani.io/filtergrouptutorial/).

12 
+
The authors found that SqueezeBERT is 4.3x faster than `bertbaseuncased` on a Google Pixel 3 smartphone.

13 
+

14 
+

15 
+
## Pretraining

16 
+

17 
+
### Pretraining data

18 
+
 [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of thousands of unpublished books

19 
+
 [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)

20 
+

21 
+
### Pretraining procedure

22 
+
The model is pretrained using the Masked Language Model (MLM) and Sentence Order Prediction (SOP) tasks.

23 
+
(Author's note: If you decide to pretrain your own model, and you prefer to train with MLM only, that should work too.)

24 
+

25 
+
From the SqueezeBERT paper:

26 
+
> We pretrain SqueezeBERT from scratch (without distillation) using the [LAMB](https://arxiv.org/abs/1904.00962) optimizer, and we employ the hyperparameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of 2.5e3, and a warmup proportion of 0.28. Following the LAMB paper's recommendations, we pretrain for 56k steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.

27 
+

28 
+
## Finetuning

29 
+

30 
+
The SqueezeBERT paper presents 2 approaches to finetuning the model:

31 
+
 "finetuning without bells and whistles"  after pretraining the SqueezeBERT model, finetune it on each GLUE task

32 
+
 "finetuning with bells and whistles"  after pretraining the SqueezeBERT model, finetune it on a MNLI with distillation from a teacher model. Then, use the MNLIfinetuned SqueezeBERT model as a student model to finetune on each of the other GLUE tasks (e.g. RTE, MRPC, …) with distillation from a taskspecific teacher model.

33 
+

34 
+
A detailed discussion of the hyperparameters used for finetuning is provided in the appendix of the [SqueezeBERT paper](https://arxiv.org/abs/2006.11316).

35 
+
Note that finetuning SqueezeBERT with distillation is not yet implemented in this repo. If the author (Forrest Iandola  forrest.dnn@gmail.com) gets enough encouragement from the user community, he will add example code to Transformers for finetuning SqueezeBERT with distillation.

36 
+

37 
+
This model, `squeezebert/squeezebertmnliheadless`, is the "finetuned with bells and whistles" MNLIfinetuned SqueezeBERT model. In this particular model, we have removed the final classification layer  in other words, it is "headless." We recommend using this model if you intend to finetune the model on your own data. Using this model means that your final layer will automatically be reinitialized when you start finetuning on your data.

38 
+

39 
+
### How to finetune

40 
+
To try finetuning SqueezeBERT on the [MRPC](https://www.microsoft.com/enus/download/details.aspx?id=52398) text classification task, you can run the following command:

41 
+
```

42 
+
./utils/download_glue_data.py

43 
+

44 
+
python examples/textclassification/run_glue.py \

45 
+
model_name_or_path squeezebertbaseheadless \

46 
+
task_name mrpc \

47 
+
data_dir ./glue_data/MRPC \

48 
+
output_dir ./models/squeezebert_mrpc \

49 
+
overwrite_output_dir \

50 
+
do_train \

51 
+
do_eval \

52 
+
num_train_epochs 10 \

53 
+
learning_rate 3e05 \

54 
+
per_device_train_batch_size 16 \

55 
+
save_steps 20000

56 
+

57 
+
```

58 
+

59 
+
## BibTeX entry and citation info

60 
+
```

61 
+
@article{2020_SqueezeBERT,

62 
+
author = {Forrest N. Iandola and Albert E. Shaw and Ravi Krishna and Kurt W. Keutzer},

63 
+
title = {{SqueezeBERT}: What can computer vision teach NLP about efficient neural networks?},

64 
+
journal = {arXiv:2006.11316},

65 
+
year = {2020}

66 
+
}

67 
+
```
