Quantization is a technique to reduce the computational and memory costs of running inference by representing the
weights and activations with low-precision data types like 8-bit integer (
int8) instead of the usual 32-bit floating
Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.
The basic idea behind quantization is quite easy: going from high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower precision data type. The most common lower precision data types are:
float16, accumulation data type
bfloat16, accumulation data type
int16, accumulation data type
int8, accumulation data type
The accumulation data type specifies the type of the result of accumulating (adding, multiplying, etc) values of the
data type in question. For example, let’s consider two
A = 127,
B = 127, and let’s define
C as the
C = A + B
Here the result is much bigger than the biggest representable value in
int8, which is
127. Hence the need for a larger
precision data type to avoid a huge precision loss that would make the whole quantization process useless.
The two most common quantization cases are
float32 -> float16 and
float32 -> int8.
Performing quantization to go from
float16 is quite straightforward since both data types follow the same
representation scheme. The questions to ask yourself when quantizing an operation to
- Does my operation have a
- Does my hardware suport
float16? For instance, Intel CPUs have been supporting
float16as a storage type, but computation is done after converting to
float32. Full support will come in Cooper Lake and Sapphire Rapids.
- Is my operation sensitive to lower precision?
For instance the value of epsilon in
LayerNormis usually very small (~
1e-12), but the smallest representable value in
6e-5, this can cause
NaNissues. The same applies for big values.
Performing quantization to go from
int8 is more tricky. Only 256 values can be represented in
float32 can represent a very wide range of values. The idea is to find the best way to project our range
float32 values to the
Let’s consider a float
[a, b], then we can write the following quantization scheme, also called the affine
x = S * (x_q - Z)
x_qis the quantized
int8value associated to
Zare the quantization parameters
Sis the scale, and is a positive
Zis called the zero-point, it is the
int8value corresponding to the value
float32realm. This is important to be able to represent exactly the value
0because it is used everywhere throughout machine learning models.
The quantized value
[a, b] can be computed as follows:
x_q = round(x/S + Z)
float32 values outside of the
[a, b] range are clipped to the closest representable value, so for any
x_q = clip(round(x/S + Z), round(a/S + Z), round(b/S + Z))
round(a/S + Z) corresponds to the smallest representable value in the considered data type, and
round(b/S + Z)
to the biggest one. But this can vary, for instance when using a symmetric quantization scheme as you will see in the next
The equation above is called the affine quantization sheme because the mapping from
[a, b] to
int8 is an affine one.
A common special case of this scheme is the symmetric quantization scheme, where we consider a symmetric range of float values
In this case the integer space is usally
[-127, 127], meaning that the
-128 is opted out of the regular
[-128, 127] signed
The reason being that having both ranges symmetric allows to have
Z = 0. While one value out of the 256 representable
values is lost, it can provide a speedup since a lot of addition operations can be skipped.
Note: To learn how the quantization parameters
Z are computed, you can read the
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
paper, or Lei Mao’s blog post on the subject.
Depending on the accuracy / latency trade-off you are targetting you can play with the granularity of the quantization parameters:
- Quantization parameters can be computed on a per-tensor basis, meaning that one pair of
(S, Z)will be used per tensor.
- Quantization parameters can be computed on a per-channel basis, meaning that it is possible to store a pair of
(S, Z)per element along one of the dimensions of a tensor. For example for a tensor of shape
[N, C, H, W], having per-channel quantization parameters for the second dimension would result in having
(S, Z). While this can give a better accuracy, it requires more memory.
The section above described how quantization from
int8 works, but one question
remains: how is the
[a, b] range of
float32 values determined? That is where calibration comes in to play.
Calibration is the step during quantization where the
float32 ranges are computed. For weights it is quite easy since
the actual range is known at quantization-time. But it is less clear for activations, and different approaches exist:
- Post training dynamic quantization: the range for each activation is computed on the fly at runtime. While this gives great results without too much work, it can be a bit slower than static quantization because of the overhead introduced by computing the range each time. It is also not an option on certain hardware.
- Post training static quantization: the range for each activation is computed in advance at quantization-time,
typically by passing representative data through the model and recording the activation values. In practice, the steps are:
- Observers are put on activations to record their values.
- A certain number of forward passes on a calibration dataset is done (around
200examples is enough).
- The ranges for each computation are computed according to some calibration technique.
- Quantization aware training: the range for each activation is computed at training-time, following the same idea than post training static quantization. But “fake quantize” operators are used instead of observers: they record values just as observers do, but they also simulate the error induced by quantization to let the model adapt to it.
For both post training static quantization and quantization aware training, it is necessary to define calibration techniques, the most common are:
- Min-max: the computed range is
[min observed value, max observed value], this works well with weights.
- Moving average min-max: the computed range is
[moving average min observed value, moving average max observed value], this works well with activations.
- Histogram: records a histogram of values along with min and max values, then chooses according to some criterion:
- Entropy: the range is computed as the one minimizing the error between the full-precision and the quantized data.
- Mean Square Error: the range is computed as the one minimizing the mean square error between the full-precision and the quantized data.
- Percentile: the range is computed using a given percentile value
pon the observed values. The idea is to try to have
p%of the observed values in the computed range. While this is possible when doing affine quantization, it is not always possible to exactly match that when doing symmetric quantization. You can check how it is done in ONNX Runtime for more details.
To effectively quantize a model to
int8, the steps to follow are:
- Choose which operators to quantize. Good operators to quantize are the one dominating it terms of computation time, for instance linear projections and matrix multiplications.
- Try post-training dynamic quantization, if it is fast enough stop here, otherwise continue to step 3.
- Try post-training static quantization which can be faster than dynamic quantization but often with a drop in terms of accuracy. Apply observers to your models in places where you want to quantize.
- Choose a calibration technique and perform it.
- Convert the model to its quantized form: the observers are removed and the
float32operators are converted to their
- Evaluate the quantized model: is the accuracy good enough? If yes, stop here, otherwise start again at step 3 but with quantization aware training this time.
🤗 Optimum provides APIs to perform quantization using different tools for different targets:
optimum.onnxruntimepackage allows to quantize and run ONNX models using the ONNX Runtime tool.
optimum.intelpackage enables to quantize 🤗 Transformers models while respecting accuracy and latency constraints.
optimum.fxpackage provides wrappers around the PyTorch quantization functions to allow graph-mode quantization of 🤗 Transformers models in PyTorch. This is a lower-level API compared to the two mentioned above, giving more flexibility, but requiring more work on your end.
optimum.gptqpackage allows to quantize and run LLM models with GPTQ.
The section is not fundamental to understand the rest. It explains in brief how numbers are represented in computers. Since quantization is about going from one representation to another, it can be useful to have some basics, but it is definitely not mandatory.
The most fundamental unit of representation for computers is the bit. Everything in computers is represented as a sequence of bits, including numbers. But the representation varies whether the numbers in question are integers or real numbers.
Integers are usually represented with the following bit lengths:
64. When representing integers, two cases
- Unsigned (positive) integers: they are simply represented as a sequence of bits. Each bit corresponds to a power
of two (from
nis the bit-length), and the resulting number is the sum of those powers of two.
19 is represented as an unsigned int8 as
00010011 because :
19 = 0 x 2^7 + 0 x 2^6 + 0 x 2^5 + 1 x 2^4 + 0 x 2^3 + 0 x 2^2 + 1 x 2^1 + 1 x 2^0
- Signed integers: it is less straightforward to represent signed integers, and multiple approachs exist, the most common being the two’s complement. For more information, you can check the Wikipedia page on the subject.
Real numbers are usually represented with the following bit lengths:
The two main ways of representing real numbers are:
- Fixed-point: there are fixed number of digits reserved for representing the integer part and the fractional part.
- Floating-point: the number of digits for representing the integer and the fractional parts can vary.
The floating-point representation can represent bigger ranges of values, and this is the one we will be focusing on since it is the most commonly used. There are three components in the floating-point representation:
- The sign bit: this is the bit specifying the sign of the number.
- The exponent part
- 5 bits in
- 8 bits in
- 8 bits in
- 11 bits in
- The mantissa
- 11 bits in
float16(10 explictly stored)
- 8 bits in
bfloat16(7 explicitly stored)
- 24 bits in
float32(23 explictly stored)
- 53 bits in
float64(52 explictly stored)
For more information on the bits allocation for each data type, check the nice illustration on the Wikipedia page about the bfloat16 floating-point format.
For a real number
x we have:
x = sign x mantissa x (2^exponent)
- The Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference paper
- The Basics of Quantization in Machine Learning (ML) for Beginners blog post
- The How to accelerate and compress neural networks with quantization blog post
- The Wikipedia pages on integers representation here and here
- The Wikipedia pages on