Model Details: int8 1x4 Sparse Distilbert

The article discusses the how to make inference of transformer-based models more efficient on Intel hardware. The authors propose sparse pattern 1x4 to fit Intel instructions and improve the performance. We implement 1x4 block pruning and get an 80% sparse model on the SQuAD1.1 dataset. Combined with quantization, it achieves up to x24.2 speedup with less than 1% accuracy loss. The article also shows performance gains of other models with this approach. The model card has been written by Intel.

Model license

Licensed under MIT license.

Model Detail	Description
language:	en
Model Authors Company	Intel
Date	June 7, 2023
Version	1
Type	NLP - Tiny language model
Architecture	" we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x performance speedup over ONNX Runtime."
Paper or Other Resources	https://arxiv.org/abs/2211.07715.pdf
License	TBD

How to use

Please follow Readme in https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/pytorch/text-classification/deployment/sparse/distilbert_base_uncased

Intel
/

distilbert-base-uncased-squadv1.1-sparse-80-1x4-block-pruneofa-int8

Model Details: int8 1x4 Sparse Distilbert

Model license

How to use

Collection including Intel/distilbert-base-uncased-squadv1.1-sparse-80-1x4-block-pruneofa-int8

DistilBERT