Sparse LLM Inference on CPU

Community Article Published October 18, 2023


Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Sparse Finetuning for Inference Acceleration of Large Language Models

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity. Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.

What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.

This post will dive into more details from this paper.

The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:

  • Speech transcription using Whisper
  • Machine translation using T5
  • Higher-level reasoning using the open GPT-type MPT model

Challenges of Large Language Models

When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:

  • The fine-tuning data may not be as large as the training data
  • The desire to achieve high sparsity levels

The main challenges are loss spikes at high sparsity levels, the inability to recover the original accuracy, and overfitting on the limited fine-tuning data. The researchers address these challenges by combining several methods including a type of per-token ℓ2 knowledge distillation and SparseGPT. They call the method ​​SquareHead distillation, showing that it can recover accuracy even at higher sparsity levels. They then run the resulting model using DeepSparse inference runtime to show that they can benefit from accelerated inference on CPU.


Compression of Generative Models

The authors investigated the compression of the open-source Mosaic pre-trained model MPT-7B for generative tasks. The model was sparse fine-tuned on GSM8K, a dataset with high-quality and diverse grade school math problems.

For this task zero-shot evaluation achieved 0% while 8-shot evaluation achieved 6.8%. The model was therefore fine-tuned via supervised fine-tuning (SFT).

The first step involved fine-tuning the MPT-7B via SFT to obtain a dense and accurate baseline for use as a teacher during distillation. Next, the researchers applied oneshot unstructured pruning with SparseGPT to 40%, 50%, 60%, 70%, and 80% sparsity targets, uniformly across all layers. The model weights and activations were then quantized to 8-bit using SparseML. The resulting models were evaluated on the GSM8K task using the Language Model Evaluation Harness.


Accelerating Compressed Language Models

Sparse acceleration is the next step after making large language models smaller. Introducing sparsity in the model means that the zero computations can be skipped at inference time. Since the language models are memory-bound, the sparse weights can be stored in compressed form and decompressed as needed when performing layer computation. Leveraging a runtime such as DeepSparse which implements sparsity-aware inference in both memory and compute-bound scenarios is an ideal solution.

Final Thoughts

The results from this paper show that sparsity can be an effective approach in accelerating LLM inference on commodity CPUs. This is critical in making LLMs accessible, especially on devices with limited memory, storage, and computation power such as mobile phones and edge devices. It will also make the deployment of language models cheaper since they can be deployed on readily available commodity CPUs.

Check out the demo of the MPT model running on CPU on Hugging Face Spaces. Interested in deploying large language models on CPU? Check out DeepSparse on GitHub or join a community of other LLM developers.