Community Computer Vision Course documentation

Introduction to model optimization for deployment

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Introduction to model optimization for deployment

Have you ever felt confused after the model training stage? What else should you do? If yes, this chapter will help you. In general, the step after we have trained a computer vision model is to deploy it so that other people can use our model. However, when the model has successfully deployed in production, many problems arise, such as the model size being too large, the prediction process taking a long time, and limited memory on the device. These problems can happen because we usually deploy models on devices with smaller specifications than the hardware for training. To overcome the issues, we can carry out additional stages before deploying and model optimization.

What is model optimization?

Model optimization is a process of modifying a model we trained to make it better in terms of efficiency. These modifications are crucial because the hardware we use during training and inference will be very different in most cases. The hardware specifications at the time of inference are smaller, which is why this optimization model needs to be carried out. For example, we have training on high-performance GPUs, and the model inference process will run on edge devices (e.g., microcomputers, mobile devices, IoT, etc.). Of course, these devices have different specifications and tend to be smaller. Carrying out model optimization is crucial so our model can run smoothly on devices with lower specifications.

Why is it important for deployment in computer vision?

As we already know, optimizing the model is important in before the deployment stage, but why? Several things make this optimization model important to do before the deployment stage. Some of these things are:

  1. Resource limitations: Computer vision models often require high computational resources such as memory, CPU, and GPU. This will be a problem if we want to deploy the model on devices with limited resources, such as mobile phones, embedded systems, or edge devices. Optimization techniques can reduce model size and computational cost and make it deployable for that platform.
  2. Latency requirements: Many computer vision applications, such as self-driving cars and augmented reality, require real-time response. This means the model must be able to process data and generate results quickly. Optimization can significantly increase the inference speed of a model and ensure it can meet latency constraints.
  3. Power consumption: Devices that use batteries, such as drones and wearable devices, require models with efficient power usage. Optimization techniques can also reduce battery consumption which is often caused by model sizes that are too large.
  4. Hardware compatibility: Sometimes, different hardware has its capabilities and limitations. Several optimization techniques are specifically used for specific hardware. If this is done, we can easily overcome the hardware limitations.

Different types of model optimization techniques

There are several techniques in the model optimization, which will be explained in the next section. However, this section will briefly describe several types:

  1. Pruning: Pruning is the process of eliminating redundant or unimportant connections in the model. This aims to reduce model size and complexity.

Pruning

  1. Quantization: Quantization means converting model weights from high-precision formats (e.g., 32-bit floating-point) to lower-precision formats (e.g., 16-bit floating-point or 8-bit integers) to reduce memory footprint and increase inference speed.
  2. Knowledge Distillation: Knowledge distillation aims to transfer knowledge from a complex and larger model (teacher model) to a smaller model (student model) by mimicking the behavior of the teacher model.

Knowledge Distillation

  1. Low-rank approximation: Approximates large matrices with small ones, reducing memory consumption and computational costs.
  2. Model compression with hardware accelerators: This process is like pruning and quantization. But, running on specific hardware such as NVIDIA GPUs and Intel Hardware.

Trade-offs between accuracy, performance, and resource usage

A trade-off exists between accuracy, performance, and resource usage when deploying a model. That’s when we have to decide which part to prioritize so that the model can be maximized in the case at hand.

  1. Accuracy is the model’s ability to predict correctly. High accuracy is needed in all applications, which also causes higher performance and resource usage. Complex models with high accuracy usually require a lot of memory, so there will be limitations if they are deployed on resource-constrained devices.
  2. Performance is the model’s speed and efficiency (latency). This is important so the model can make predictions quickly, even in real time. However, optimizing performance will usually result in decreasing accuracy.
  3. Resource usage is the computational resources needed to perform inference on the model, such as CPU, memory, and storage. Efficient resource usage is crucial if we want to deploy models on devices with certain limitations, such as smartphones or IoT devices.

The image below shows a common computer vision model in terms of model size, accuracy, and latency. A bigger model has high accuracy, but needs more time for inference and has a larger file size.

Model Size VS Accuracy

Accuracy VS Latency

These are the three things we must consider: where do we focus on the model we trained? For example, focusing on high accuracy will result in a slower model during inference or require extensive resources. To overcome this, we apply one of the optimization methods as explained so that the model we get can maximize or balance the trade-off between the three components mentioned above.

< > Update on GitHub