Edit model card

Model Details: QuaLA-MiniLM

The article discusses the challenge of making transformer-based models efficient enough for practical use, given their size and computational requirements. The authors propose a new approach called QuaLA-MiniLM, which combines knowledge distillation, the length-adaptive transformer (LAT) technique, and low-bit quantization. We expand the Dynamic-TinyBERT approach. This approach trains a single model that can adapt to any inference scenario with a given computational budget, achieving a superior accuracy-efficiency trade-off on the SQuAD1.1 dataset. The authors compare their approach to other efficient methods and find that it achieves up to an x8.8 speedup with less than 1% accuracy loss. They also provide their code publicly on GitHub. The article also discusses other related work in the field, including dynamic transformers and other knowledge distillation approaches.

The model card has been written in combination by Intel.

QuaLA-MiniLM training process

Figure showing QuaLA-MiniLM training process. To run the model with the best accuracy-efficiency tradeoff per a specific computational budget, we set the length configuration to the best setting found by an evolutionary search to match our computational constraint. ArchitecureQuaLA-MiniLM.jpg

Model license

Licensed under MIT license.

Model Detail Description
language: en
Model Authors Company Intel
Date May 4, 2023
Version 1
Type NLP - Tiny language model
Architecture "In this work we expand Dynamic-TinyBERT to generate a much more highly efficient model. First, we use a much smaller MiniLM model which was distilled from a RoBERTa-Large teacher rather than BERT-base. Second, we apply the LAT method to make the model length-adaptive, and finally we further enhance the model鈥檚 efficiency by applying 8-bit quantization. The resultant QuaLAMiniLM (Quantized Length-Adaptive MiniLM) model outperforms BERT-base with only 30% of parameters, and demonstrates an accuracy-speedup tradeoff that is superior to any other efficiency approach (up to x8.8 speedup with <1% accuracy loss) on the challenging SQuAD1.1 benchmark. Following the concept presented by LAT, it provides a wide range of accuracy-efficiency tradeoff points while alleviating the need to retrain it for each point along the accuracy-efficiency curve."
Paper or Other Resources https://arxiv.org/pdf/2210.17114.pdf
License TBD
Questions or Comments Community Tab and Intel Developers Discord
Intended Use Description
Primary intended uses TBD
Primary intended users Anyone who needs an efficient tiny language model for other downstream tasks.
Out-of-scope uses The model should not be used to intentionally create hostile or alienating environments for people.

How to use

Code examples coming soon!

import ...
 

Metrics (Model Performance):

Inference performance on the SQuAD1.1 evaluation dataset. For all the length-adaptive (LA) models we show the performance both of running the model without token-dropping, and of running the model in a token-dropping configuration according to the optimal length configuration found to meet our accuracy constraint.

Model Model size (Mb) Tokens per layer Accuracy (F1) Latency (ms) FLOPs Speedup
BERT-base 415.4723 (384,384,384,384,384,384) 88.5831 56.5679 3.53E+10 1x
TinyBERT-ours 253.2077 (384,384,384,384,384,384) 88.3959 32.4038 1.77E+10 1.74x
QuaTinyBERT-ours 132.0665 (384,384,384,384,384,384) 87.6755 15.5850 1.77E+10 3.63x
MiniLMv2-ours 115.0473 (384,384,384,384,384,384) 88.7016 18.2312 4.76E+09 3.10x
QuaMiniLMv2-ours 84.8602 (384,384,384,384,384,384) 88.5463 9.1466 4.76E+09 6.18x
LA-MiniLM 115.0473 (384,384,384,384,384,384) 89.2811 16.9900 4.76E+09 3.33x
LA-MiniLM 115.0473 (269, 253, 252, 202, 104, 34) 87.7637 11.4428 2.49E+09 4.94x
QuaLA-MiniLM 84.8596 (384,384,384,384,384,384) 88.8593 7.4443 4.76E+09 7.6x
QuaLA-MiniLM 84.8596 (315,251,242,159,142,33) 87.6828 6.4146 2.547E+09 8.8x

Training and Evaluation Data

Training and Evaluation Data Description
Datasets SQuAD1.1 dataset
Motivation To build an efficient and accurate base model for several downstream language tasks.

Ethical Considerations

Ethical Considerations Description
Data SQuAD1.1 dataset
Human life The model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of labelled Wikipedia articles.
Mitigations No additional risk mitigation strategies were considered during model development.
Risks and harms Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al., 2021, and Bender et al., 2021). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. Beyond this, the extent of the risks involved by using the model remain unknown.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. There are no additional caveats or recommendations for this model.

BibTeX entry and citation info

comments description
comments: In this version we added reference to the source code in the abstract. arXiv admin note: text overlap with arXiv:2111.09645
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2210.17114 [cs.CL]
- (or arXiv:2210.17114v2 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2210.17114
Downloads last month
10

Collection including Intel/dynamic-minilmv2-L6-H384-squad1.1-int8-static