|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
QuaLA-MiniLM: a Quantized Length Adaptive |
|
MiniLM |
|
|
|
The article discusses the challenge of making transformer-based models efficient enough for practical use, |
|
given their size and computational requirements. The authors propose a new approach called QuaLA-MiniLM, |
|
which combines knowledge distillation, the length-adaptive transformer (LAT) technique, |
|
and low-bit quantization. This approach trains a single model that can adapt to any |
|
inference scenario with a given computational budget, achieving a superior accuracy-efficiency |
|
trade-off on the SQuAD1.1 dataset. The authors compare this approach to other efficient methods |
|
and find that it achieves up to an x8.8 speedup with less than 1% accuracy loss. |
|
The authors also provide their code publicly on GitHub. The article also discusses other related work |
|
in the field, including dynamic transformers and other knowledge distillation approaches. |
|
|
|
|
|
|
|
|
|
|