Quantization
Quantization is a technique to reduce the computational and memory costs of running inference by representing the
weights and activations with low-precision data types like 8-bit integer (int8
) instead of the usual 32-bit floating
point (float32
).
Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.
Theory
The basic idea behind quantization is quite easy: going from high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower precision data type. The most common lower precision data types are:
float16
, accumulation data typefloat16
bfloat16
, accumulation data typefloat32
int16
, accumulation data typeint32
int8
, accumulation data typeint32
The accumulation data type specifies the type of the result of accumulating (adding, multiplying, etc) values of the
data type in question. For example, let’s consider two int8
values A = 127
, B = 127
, and let’s define C
as the
sum of A
and B
:
C = A + B
Here the result is much bigger than the biggest representable value in int8
, which is 127
. Hence the need for a larger
precision data type to avoid a huge precision loss that would make the whole quantization process useless.
Quantization
The two most common quantization cases are float32 -> float16
and float32 -> int8
.
Quantization to float16
Performing quantization to go from float32
to float16
is quite straightforward since both data types follow the same
representation scheme. The questions to ask yourself when quantizing an operation to float16
are:
- Does my operation have a
float16
implementation? - Does my hardware suport
float16
? For instance, Intel CPUs have been supportingfloat16
as a storage type, but computation is done after converting tofloat32
. Full support will come in Cooper Lake and Sapphire Rapids. - Is my operation sensitive to lower precision?
For instance the value of epsilon in
LayerNorm
is usually very small (~1e-12
), but the smallest representable value infloat16
is ~6e-5
, this can causeNaN
issues. The same applies for big values.
Quantization to int8
Performing quantization to go from float32
to int8
is more tricky. Only 256 values can be represented in int8
,
while float32
can represent a very wide range of values. The idea is to find the best way to project our range [a, b]
of float32
values to the int8
space.
Let’s consider a float x
in [a, b]
, then we can write the following quantization scheme, also called the affine
quantization scheme:
x = S * (x_q - Z)
where:
x_q
is the quantizedint8
value associated tox
S
andZ
are the quantization parametersS
is the scale, and is a positivefloat32
Z
is called the zero-point, it is theint8
value corresponding to the value0
in thefloat32
realm. This is important to be able to represent exactly the value0
because it is used everywhere throughout machine learning models.
The quantized value x_q
of x
in [a, b]
can be computed as follows:
x_q = round(x/S + Z)
And float32
values outside of the [a, b]
range are clipped to the closest representable value, so for any
floating-point number x
:
x_q = clip(round(x/S + Z), round(a/S + Z), round(b/S + Z))
Usually round(a/S + Z)
corresponds to the smallest representable value in the considered data type, and round(b/S + Z)
to the biggest one. But this can vary, for instance when using a symmetric quantization scheme as you will see in the next
paragraph.
Symmetric and affine quantization schemes
The equation above is called the affine quantization sheme because the mapping from [a, b]
to int8
is an affine one.
A common special case of this scheme is the symmetric quantization scheme, where we consider a symmetric range of float values [-a, a]
.
In this case the integer space is usally [-127, 127]
, meaning that the -128
is opted out of the regular [-128, 127]
signed int8
range.
The reason being that having both ranges symmetric allows to have Z = 0
. While one value out of the 256 representable
values is lost, it can provide a speedup since a lot of addition operations can be skipped.
Note: To learn how the quantization parameters S
and Z
are computed, you can read the
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
paper, or Lei Mao’s blog post on the subject.
Per-tensor and per-channel quantization
Depending on the accuracy / latency trade-off you are targetting you can play with the granularity of the quantization parameters:
- Quantization parameters can be computed on a per-tensor basis, meaning that one pair of
(S, Z)
will be used per tensor. - Quantization parameters can be computed on a per-channel basis, meaning that it is possible to store a pair of
(S, Z)
per element along one of the dimensions of a tensor. For example for a tensor of shape[N, C, H, W]
, having per-channel quantization parameters for the second dimension would result in havingC
pairs of(S, Z)
. While this can give a better accuracy, it requires more memory.
Calibration
The section above described how quantization from float32
to int8
works, but one question
remains: how is the [a, b]
range of float32
values determined? That is where calibration comes in to play.
Calibration is the step during quantization where the float32
ranges are computed. For weights it is quite easy since
the actual range is known at quantization-time. But it is less clear for activations, and different approaches exist:
- Post training dynamic quantization: the range for each activation is computed on the fly at runtime. While this gives great results without too much work, it can be a bit slower than static quantization because of the overhead introduced by computing the range each time. It is also not an option on certain hardware.
- Post training static quantization: the range for each activation is computed in advance at quantization-time,
typically by passing representative data through the model and recording the activation values. In practice, the steps are:
- Observers are put on activations to record their values.
- A certain number of forward passes on a calibration dataset is done (around
200
examples is enough). - The ranges for each computation are computed according to some calibration technique.
- Quantization aware training: the range for each activation is computed at training-time, following the same idea than post training static quantization. But “fake quantize” operators are used instead of observers: they record values just as observers do, but they also simulate the error induced by quantization to let the model adapt to it.
For both post training static quantization and quantization aware training, it is necessary to define calibration techniques, the most common are:
- Min-max: the computed range is
[min observed value, max observed value]
, this works well with weights. - Moving average min-max: the computed range is
[moving average min observed value, moving average max observed value]
, this works well with activations. - Histogram: records a histogram of values along with min and max values, then chooses according to some criterion:
- Entropy: the range is computed as the one minimizing the error between the full-precision and the quantized data.
- Mean Square Error: the range is computed as the one minimizing the mean square error between the full-precision and the quantized data.
- Percentile: the range is computed using a given percentile value
p
on the observed values. The idea is to try to havep%
of the observed values in the computed range. While this is possible when doing affine quantization, it is not always possible to exactly match that when doing symmetric quantization. You can check how it is done in ONNX Runtime for more details.
Pratical steps to follow to quantize a model to int8
To effectively quantize a model to int8
, the steps to follow are:
- Choose which operators to quantize. Good operators to quantize are the one dominating it terms of computation time, for instance linear projections and matrix multiplications.
- Try post-training dynamic quantization, if it is fast enough stop here, otherwise continue to step 3.
- Try post-training static quantization which can be faster than dynamic quantization but often with a drop in terms of accuracy. Apply observers to your models in places where you want to quantize.
- Choose a calibration technique and perform it.
- Convert the model to its quantized form: the observers are removed and the
float32
operators are converted to theirint8
coutnerparts. - Evaluate the quantized model: is the accuracy good enough? If yes, stop here, otherwise start again at step 3 but with quantization aware training this time.
Supported tools to perform quantization in 🤗 Optimum
🤗 Optimum provides APIs to perform quantization using different tools for different targets:
- The
optimum.onnxruntime
package allows to quantize and run ONNX models using the ONNX Runtime tool. - The
optimum.intel
package enables to quantize 🤗 Transformers models while respecting accuracy and latency constraints. - The
optimum.fx
package provides wrappers around the PyTorch quantization functions to allow graph-mode quantization of 🤗 Transformers models in PyTorch. This is a lower-level API compared to the two mentioned above, giving more flexibility, but requiring more work on your end. - The
optimum.gptq
package allows to quantize and run LLM models with GPTQ.
Going further: How do machines represent numbers?
The section is not fundamental to understand the rest. It explains in brief how numbers are represented in computers. Since quantization is about going from one representation to another, it can be useful to have some basics, but it is definitely not mandatory.
The most fundamental unit of representation for computers is the bit. Everything in computers is represented as a sequence of bits, including numbers. But the representation varies whether the numbers in question are integers or real numbers.
Integer representation
Integers are usually represented with the following bit lengths: 8
, 16
, 32
, 64
. When representing integers, two cases
are considered:
- Unsigned (positive) integers: they are simply represented as a sequence of bits. Each bit corresponds to a power
of two (from
0
ton-1
wheren
is the bit-length), and the resulting number is the sum of those powers of two.
Example: 19
is represented as an unsigned int8 as 00010011
because :
19 = 0 x 2^7 + 0 x 2^6 + 0 x 2^5 + 1 x 2^4 + 0 x 2^3 + 0 x 2^2 + 1 x 2^1 + 1 x 2^0
- Signed integers: it is less straightforward to represent signed integers, and multiple approachs exist, the most common being the two’s complement. For more information, you can check the Wikipedia page on the subject.
Real numbers representation
Real numbers are usually represented with the following bit lengths: 16
, 32
, 64
.
The two main ways of representing real numbers are:
- Fixed-point: there are fixed number of digits reserved for representing the integer part and the fractional part.
- Floating-point: the number of digits for representing the integer and the fractional parts can vary.
The floating-point representation can represent bigger ranges of values, and this is the one we will be focusing on since it is the most commonly used. There are three components in the floating-point representation:
- The sign bit: this is the bit specifying the sign of the number.
- The exponent part
- 5 bits in
float16
- 8 bits in
bfloat16
- 8 bits in
float32
- 11 bits in
float64
- The mantissa
- 11 bits in
float16
(10 explictly stored) - 8 bits in
bfloat16
(7 explicitly stored) - 24 bits in
float32
(23 explictly stored) - 53 bits in
float64
(52 explictly stored)
For more information on the bits allocation for each data type, check the nice illustration on the Wikipedia page about the bfloat16 floating-point format.
For a real number x
we have:
x = sign x mantissa x (2^exponent)
References
- The Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference paper
- The Basics of Quantization in Machine Learning (ML) for Beginners blog post
- The How to accelerate and compress neural networks with quantization blog post
- The Wikipedia pages on integers representation here and here
- The Wikipedia pages on