---
pipeline_tag: text-generation
tags:
- openvino
- mpt
- sparse
- quantization
library_name: "OpenVINO"
---

The intent of this repo is to compare the performance delta between dense quantized MPT-7B and 70% sparse-quantized MPT-7B on OpenVINO. Quantization here is 8-bit on both weight and activation. Benchmark metric is decoding (next token) latency with context length 512.

Target HW: Intel 4th gen Xeon (Sapphire Rapids) 

SW
```
git clone https://huggingface.co/vuiseng9/ov-mpt-7b-gsm8k-sparse70
pip install openvino==2024.2.0
```

## Benchmarking with OpenVINO

1. ./benchmarkapp_w8a8.bash
2. ./benchmarkapp_w8a8_sparse70.bash

Note: do remove the numactl if your node does not support it. 


## Implementation of Sparse Weight Decompression in OpenVINO
* This is the first commit of Sparse Weight Decompression on OpenVINO’s fork of oneDNN. 
https://github.com/openvinotoolkit/oneDNN/pull/158/files

* you can browse this via the left pane.

* initialization: src/cpu/reorder/simple_sparse_reorder.hpp ([line 113](https://github.com/openvinotoolkit/oneDNN/pull/158/files#diff-f1445f832cd9979d9756873e3d8c30716976f51b6ce4640eae12762a417284e3R113))

* decompression: src/cpu/x64/jit_brgemm_decompress_kernel.cpp ([line 41](https://github.com/openvinotoolkit/oneDNN/pull/158/files#diff-98844e424b6687de78d47737e62f206dc9befcec6887dac8b2c52d0303dd3576R41))


* If you'd like to build OpenVINO runtime from source for debug, [see wiki page](https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md). Benchmark_app is compiled as well.

## Related materials:
[OpenVINO blog on Sparse-Quantized BERT](https://blog.openvino.ai/blog-posts/accelerate-inference-of-sparse-transformer-models-with-openvino-tm-and-4th-gen-intel-r-xeon-r-scalable-processors) ([corresponding notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/116-sparsity-optimization/116-sparsity-optimization.ipynb))