File size: 6,098 Bytes

# Model Details: QuaLA-MiniLM
The article discusses the challenge of making transformer-based models efficient enough for practical use, given their size and computational requirements. The authors propose a new approach called **QuaLA-MiniLM**, which combines knowledge distillation, the length-adaptive transformer (LAT) technique, and low-bit quantization. We expand the Dynamic-TinyBERT approach. This approach trains a single model that can adapt to any inference scenario with a given computational budget, achieving a superior accuracy-efficiency trade-off on the SQuAD1.1 dataset. The authors compare their approach to other efficient methods and find that it achieves up to an **x8.8 speedup with less than 1% accuracy loss**. They also provide their code publicly on GitHub. The article also discusses other related work in the field, including dynamic transformers and other knowledge distillation approaches.

The model card has been written in combination by Intel.

### QuaLA-MiniLM training process
Figure showing QuaLA-MiniLM training process. To run the model with the best accuracy-efficiency tradeoff per a specific computational budget, we set the length configuration to the best setting found by an evolutionary search to match our computational constraint.
![ArchitecureQuaLA-MiniLM.jpg](ArchitecureQuaLA-MiniLM.jpg)

### Model license
Licensed under MIT license.

| Model Detail | Description |
| ----         | ---         |
| language: | en |
| Model Authors Company | Intel |
| Date | May 4, 2023 |
| Version | 1 |
| Type | NLP - Tiny language model|
| Architecture | "In this work we expand Dynamic-TinyBERT to generate a much more highly efficient model. First, we use a much smaller MiniLM model which was distilled from a RoBERTa-Large teacher rather than BERT-base. Second, we apply the LAT method to make the model length-adaptive, and finally we further enhance the model’s efficiency by applying 8-bit quantization. The resultant QuaLAMiniLM (Quantized Length-Adaptive MiniLM) model outperforms BERT-base with only 30% of parameters, and demonstrates an accuracy-speedup tradeoff that is superior to any other efficiency approach (up to x8.8 speedup with <1% accuracy loss) on the challenging SQuAD1.1 benchmark. Following the concept presented by LAT, it provides a wide range of accuracy-efficiency tradeoff points while alleviating the need to retrain it for each point along the accuracy-efficiency curve." |
| Paper or Other Resources | https://arxiv.org/pdf/2210.17114.pdf |
| License | TBD |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/bert-base-uncased-sparse-90-unstructured-pruneofa/discussions) and [Intel Developers Discord](https://discord.gg/rv2Gp55UJQ) |


| Intended Use | Description |
| --- | --- |
| Primary intended uses | TBD |
| Primary intended users | Anyone who needs an efficient tiny language model for other downstream tasks.|
|Out-of-scope uses|The model should not be used to intentionally create hostile or alienating environments for people.|

### How to use

Code examples coming soon!

```python
import ...
 
```

For more code examples, refer to the GitHub Repo.

### Metrics (Model Performance):

Inference performance on the SQuAD1.1 evaluation dataset. For all the length-adaptive
(LA) models we show the performance both of running the model without token-dropping, and of
running the model in a token-dropping configuration according to the optimal length configuration
found to meet our accuracy constraint.

|Model | Model size (Mb) |Tokens per layer |Accuracy (F1) | Latency (ms) | FLOPs | Speedup|
| --- | ---              | ---              | ---         | ---          | ---   | ---    |
|BERT-base |415.4723 |(384,384,384,384,384,384) |88.5831 |56.5679 |3.53E+10 |1x|
|TinyBERT-ours |253.2077 |(384,384,384,384,384,384) |88.3959 |32.4038 |1.77E+10 |1.74x|
|QuaTinyBERT-ours |132.0665 |(384,384,384,384,384,384) |87.6755 |15.5850 1.77E+10 |3.63x|
|MiniLMv2-ours |115.0473 |(384,384,384,384,384,384) |88.7016 |18.2312 |4.76E+09 |3.10x|
|QuaMiniLMv2-ours |84.8602 |(384,384,384,384,384,384) |88.5463 |9.1466 |4.76E+09 |6.18x|
|LA-MiniLM |115.0473 |(384,384,384,384,384,384) |89.2811 |16.9900 |4.76E+09 |3.33x|
|LA-MiniLM |115.0473 |(269, 253, 252, 202, 104, 34) |87.7637 |11.4428 |2.49E+09 |4.94x|
|QuaLA-MiniLM |84.8596 |(384,384,384,384,384,384) |88.8593 |7.4443 |4.76E+09 |7.6x|
|QuaLA-MiniLM |84.8596 |(315,251,242,159,142,33) |87.6828 |6.4146 |2.547E+09 |8.8x|

### Training and Evaluation Data

| Training and Evaluation Data | Description |
| --- | --- |
|Datasets|SQuAD1.1 dataset | 
|Motivation | To build an efficient and accurate base model for several downstream language tasks. |

### Ethical Considerations 
|Ethical Considerations|Description|
| --- | --- |
|Data | SQuAD1.1 dataset |
| Human life | The model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of labelled Wikipedia articles. |
|Mitigations| No additional risk mitigation strategies were considered during model development. |
|Risks and harms| Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al., 2021, and Bender et al., 2021). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. Beyond this, the extent of the risks involved by using the model remain unknown. |


### Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. There are no additional caveats or recommendations for this model.



### BibTeX entry and citation info
| comments | description |
| --- | --- |
| comments: | In this version we added reference to the source code in the abstract. arXiv admin note: text overlap with arXiv:2111.09645 |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2210.17114 [cs.CL]|
| - | (or arXiv:2210.17114v2 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2210.17114|