--- pipeline_tag: text-generation tags: - openvino - mpt - sparse - quantization library_name: "OpenVINO" --- The intent of this repo is to compare the performance delta between dense quantized MPT-7B and 70% sparse-quantized MPT-7B on OpenVINO. Quantization here is 8-bit on both weight and activation. Benchmark metric is decoding (next token) latency with context length 512. Target HW: Intel 4th gen Xeon (Sapphire Rapids) SW ``` git clone https://huggingface.co/vuiseng9/ov-mpt-7b-gsm8k-sparse70 pip install openvino==2024.2.0 ``` ## Benchmarking with OpenVINO 1. ./benchmarkapp_w8a8.bash 2. ./benchmarkapp_w8a8_sparse70.bash Note: do remove the numactl if your node does not support it. ## Implementation of Sparse Weight Decompression in OpenVINO * This is the first commit of Sparse Weight Decompression on OpenVINO’s fork of oneDNN. https://github.com/openvinotoolkit/oneDNN/pull/158/files * you can browse this via the left pane. * initialization: src/cpu/reorder/simple_sparse_reorder.hpp ([line 113](https://github.com/openvinotoolkit/oneDNN/pull/158/files#diff-f1445f832cd9979d9756873e3d8c30716976f51b6ce4640eae12762a417284e3R113)) * decompression: src/cpu/x64/jit_brgemm_decompress_kernel.cpp ([line 41](https://github.com/openvinotoolkit/oneDNN/pull/158/files#diff-98844e424b6687de78d47737e62f206dc9befcec6887dac8b2c52d0303dd3576R41)) * If you'd like to build OpenVINO runtime from source for debug, [see wiki page](https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md). Benchmark_app is compiled as well. ## Related materials: [OpenVINO blog on Sparse-Quantized BERT](https://blog.openvino.ai/blog-posts/accelerate-inference-of-sparse-transformer-models-with-openvino-tm-and-4th-gen-intel-r-xeon-r-scalable-processors) ([corresponding notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/116-sparsity-optimization/116-sparsity-optimization.ipynb))