Optimizing Stable Diffusion for Intel CPUs with NNCF and 🤗 Optimum

Latent Diffusion models are game changers when it comes to solving text-to-image generation problems. Stable Diffusion is one of the most famous examples that got wide adoption in the community and industry. The idea behind the Stable Diffusion model is simple and compelling: you generate an image from a noise vector in multiple small steps refining the noise to a latent image representation.

However, such an approach inevitably increases the overall inference time and causes a poor user experience when deployed on a client machine. One can note that powerful GPUs can help here as usual, and this is true, but the cost of this increases dramatically. As a reference, in H1'23, the price of a powerful CPU r6i.2xlarge instance with 8 vCPUs and 64 GB RAM is $0.504 per hour, while a similar g4dn.2xlarge instance with NVIDIA T4 with 16 GB VRAM costs $0.75 per hour, which is 1.5x more..

This makes image-generation services quite expensive to their owners and users. The problem is even more acute in the client applications that run on the user's side. There can be no GPU at all! This makes the deployment of the Stable Diffusion pipeline a challenging problem.

Through the past five years, OpenVINO Toolkit encapsulated many features for high-performance inference. Initially designed for Computer Vision models, it still dominates in this domain showing best-in-class inference performance for many contemporary models, including Stable Diffusion. However, optimizing Stable Diffusion models for resource-constraint applications requires going far beyond just runtime optimizations. And this is where model optimization capabilities from OpenVINO Neural Network Compression Framework (NNCF) come into play.

In this blog post, we will outline the problems of optimizing Stable Diffusion models and propose a workflow that substantially reduces the latency of such models when running on a resource-constrained HW such as CPU. In particular, we achieved 5.1x inference acceleration and 4x model footprint reduction compared to PyTorch.

Stable Diffusion optimization

In the Stable Diffusion pipeline, the UNet model is computationally the most expensive to run. Thus, optimizing just one model brings substantial benefits in terms of inference speed.

However, it turns out that the traditional model optimization methods, such as post-training 8-bit quantization, do not work for this model. There are two main reasons for that. First, pixel-level prediction models, such as semantic segmentation, super-resolution, etc., are one of the most complicated in terms of model optimization because of the complexity of the task, so tweaking model parameters and the structure breaks the results in numerous ways. The second reason is that the model has a lower level of redundancy because it accommodates a lot of information while being trained on hundreds of millions of samples. That is why researchers have to employ more sophisticated quantization methods to preserve the accuracy after optimization. For example, Qualcomm used the layer-wise Knowledge Distillation method (AdaRound) to quantize Stable Diffusion models. It means that model tuning after quantization is required, anyway. If so, why not just use Quantization-Aware Training (QAT) which can tune the model and quantization parameters simultaneously in the same way the source model is trained? Thus, we tried this approach in our work using NNCF, OpenVINO, and Diffusers and coupled it with Token Merging.

Optimization workflow

We usually start the optimization of a model after it's trained. Here, we start from a model fine-tuned on the Pokemons dataset containing images of Pokemons and their text descriptions.

We used the text-to-image fine-tuning example for Stable Diffusion from the Diffusers and integrated QAT from NNCF into the following training script. We also changed the loss function to incorporate knowledge distillation from the source model that acts as a teacher in this process while the actual model being trained acts as a student. This approach is different from the classical knowledge distillation method, where the trained teacher model is distilled into a smaller student model. In our case, knowledge distillation is used as an auxiliary method that helps improve the final accuracy of the optimizing model. We also use the Exponential Moving Average (EMA) method for model parameters excluding quantizers which allows us to make the training process more stable. We tune the model for 4096 iterations only.

With some tricks, such as gradient checkpointing and keeping the EMA model in RAM instead of VRAM, we can run the optimization process using one GPU with 24 GB of VRAM. The whole optimization takes less than a day using one GPU!

Going beyond Quantization-Aware Training

Quantization alone can bring significant enhancements by reducing model footprint, load time, memory consumption, and inference latency. But the great thing about quantization is that it can be applied along with other optimization methods leading to a cumulative speedup.

Recently, Facebook Research introduced a Token Merging method for Vision Transformer models. The essence of the method is that it merges redundant tokens with important ones using one of the available strategies (averaging, taking max values, etc.). This is done before the self-attention block, which is the most computationally demanding part of Transformer models. Therefore, reducing the token dimension reduces the overall computation time in the self-attention blocks. This method has also been adapted for Stable Diffusion models and has shown promising results when optimizing Stable Diffusion pipelines for high-resolution image synthesis running on GPUs.

We modified the Token Merging method to be compliant with OpenVINO and stacked it with 8-bit quantization when applied to the Attention UNet model. This also involves all the mentioned techniques including Knowledge Distillation, etc. As for quantization, it requires fine-tuning to be applied to restore the accuracy. We also start optimization and fine-tuning from the model trained on the Pokemons dataset. The figure below shows an overall optimization workflow.


The resultant model is highly beneficial when running inference on devices with limited computational resources, such as client or edge CPUs. As it was mentioned, stacking Token Merging with quantization leads to an additional reduction in the inference latency.

Image 1
PyTorch FP32, Inference Speed: 230.5 seconds, Memory Footprint: 3.44 GB
Image 2
OpenVINO FP32, Inference Speed: 120 seconds (1.9x), Memory Footprint: 3.44 GB
Image 3
OpenVINO 8-bit, Inference Speed: 59 seconds (3.9x), Memory Footprint: 0.86 GB (0.25x)
Image 4
ToMe + OpenVINO 8-bit, Inference Speed: 44.6 seconds (5.1x), Memory Footprint: 0.86 GB (0.25x)

Results of image generation demo using different optimized models. Input prompt is “cartoon bird”, seed is 42. The models are with OpenVINO 2022.3 in Hugging Face Spaces using a “CPU upgrade” instance which utilizes 3rd Generation Intel® Xeon® Scalable Processors with Intel® Deep Learning Boost technology.


We used the disclosed optimization workflows to get two types of optimized models, 8-bit quantized and quantized with Token Merging, and compare them to the PyTorch baseline. We also converted the baseline to vanilla OpenVINO floating-point (FP32) model for the comprehensive comparison.

The picture above shows the results of image generation and some model characteristics. As you can see, just conversion to OpenVINO brings a significant decrease in the inference latency ( 1.9x ). Applying 8-bit quantization boosts inference speed further leading to 3.9x speedup compared to PyTorch. Another benefit of quantization is a significant reduction of model footprint, 0.25x of PyTorch checkpoint, which also improves the model load time. Applying Token Merging (ToME) (with a merging ratio of 0.4 ) on top of quantization brings 5.1x performance speedup while keeping the footprint at the same level. We didn't provide a thorough analysis of the visual quality of the optimized models, but, as you can see, the results are quite solid.

Below we show to perform inference with the final pipeline optimized to run on Intel CPUs:

from optimum.intel.openvino import OVStableDiffusionPipeline

# Load and compile the pipeline for performance.
name = "OpenVINO/stable-diffusion-pokemons-tome-quantized-aggressive"
pipe = OVStableDiffusionPipeline.from_pretrained(name, compile=False)
pipe.reshape(batch_size=1, height=512, width=512, num_images_per_prompt=1)

# Generate an image.
prompt = "a drawing of a green pokemon with red eyes"
output = pipe(prompt, num_inference_steps=50, output_type="pil").images[0]

You can find the training and quantization code in the Hugging Face Optimum Intel library. The notebook that demonstrates the difference between optimized and original models is available here. You can also find many models on the Hugging Face Hub under the OpenVINO organization. In addition, we have created a demo on Hugging Face Spaces that is being run on r6id.2xlarge instance with 3rd Generation Intel Xeon Scalable processor.

What about the general-purpose Stable Diffusion model?

As we showed with the Pokemon image generation task, it is possible to achieve a high level of optimization of the Stable Diffusion pipeline when using a relatively small amount of training resources. At the same time, it is well-known that training a general-purpose Stable Diffusion model is an expensive task. However, with enough budget and HW resources, it is possible to optimize the general-purpose model using the described approach and tune it to produce high-quality images. The only caveat we have is related to the token merging method that reduces the model capacity substantially. The rule of thumb here is the more complicated the dataset you have for the training, the less merging ratio you should use during the optimization.

If you enjoyed reading this post, you might also be interested in checking out this post that discusses other complementary approaches to optimize the performance of Stable Diffusion on 4th generation Intel Xeon CPUs.