Optimization

🤗 Optimum Intel provides an openvino package that enables you to apply a variety of model quantization methods on many models hosted on the 🤗 hub using the NNCF framework.

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.

Optimization Support Matrix

Click on a ✅ to copy the command/code for the corresponding optimization case.

Task (OV Model Class)	Weight-only Quantization				Hybrid Quantization		Full Quantization		Mixed Quantization
	Data-free		Data-aware		Hybrid Quantization		Full Quantization		Mixed Quantization
	CLI	Python	CLI	Python	CLI	Python	CLI	Python	CLI	Python
text-generation (OVModelForCausalLM)					–	-
image-text-to-text (OVModelForVisualCausalLM)					–	–			–	–
text-to-image, text-to-video (OVDiffusionPipeline)			–	–					–	–
automatic-speech-recognition (OVModelForSpeechSeq2Seq)	–	–	–	–	–	–			–	–
feature-extraction (OVModelForFeatureExtraction)					–	-
feature-extraction (OVSentenceTransformer)					–	-
fill-mask (OVModelForMaskedLM)					–	-
text2text-generation (OVModelForSeq2SeqLM)					–	-
zero-shot-image-classification (OVModelForZeroShotImageClassification)					–	-
feature-extraction (OVSamModel)	–		–		–	-			–	–
text-to-audio (OVModelForTextToSpeechSeq2Seq)	✅		–	–	–	–	–	–	–	–

Weight-only Quantization

Quantization can be applied on the model’s Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.

8-bit

For the 8-bit weight quantization you can provide quantization_config equal to OVWeightQuantizationConfig(bits=8) to load your model’s weights in 8-bit:

from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

model_id = "helenai/gpt2-ov"
quantization_config = OVWeightQuantizationConfig(bits=8)
model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

# Saves the int8 model that will be x4 smaller than its fp32 counterpart
model.save_pretrained(saving_directory)

Weights of language models inside vision-language pipelines can be quantized in a similar way:

model = OVModelForVisualCausalLM.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    quantization_config=quantization_config
)

If quantization_config is not provided, model will be exported in 8 bits by default when it has more than 1 billion parameters. You can disable it with load_in_8bit=False.

4-bit

4-bit weight quantization can be achieved in a similar way:

from optimum.intel import OVModelForCausalLM

model = OVModelForCausalLM.from_pretrained(model_id, quantization_config={"bits": 4})

For some models, we provide preconfigured 4-bit weight-only quantization configurations that offer a good trade-off between quality and speed. This default 4-bit configuration is applied automatically when you specify quantization_config={"bits": 4}.

Or for vision-language pipelines:

model = OVModelForVisualCausalLM.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    quantization_config={"bits": 4}
)

You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:

from optimum.intel import OVWeightQuantizationConfig

quantization_config = OVWeightQuantizationConfig(
    bits=4,
    sym=False,
    ratio=0.8,
    quant_method="awq",
    dataset="wikitext2"
)

Note: OVWeightQuantizationConfig also accepts keyword arguments that are not listed in its constructor. In this case such arguments will be passed directly to nncf.compress_weights() call. This is useful for passing additional parameters to the quantization algorithm.

By default the quantization scheme will be asymmetric, to make it symmetric you can add sym=True.

For 4-bit quantization you can also specify the following arguments in the quantization configuration :

The group_size parameter will define the group size to use for quantization, -1 it will results in per-column quantization.
The ratio parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to int4 while 10% will be quantized to int8.

Smaller group_size and ratio values usually improve accuracy at the sacrifice of the model size and inference latency.

Quality of 4-bit weight compressed model can further be improved by employing one of the following data-dependent methods:

AWQ which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped. There is also a data-free version of AWQ available that relies on per-column magnitudes of weights instead of activations.
Scale Estimation is a method that tunes quantization scales to minimize the L2 error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead.
GPTQ optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer.
LoRA Correction mitigates quantization noise introduced during weight compression by leveraging low-rank adaptation.

Data-aware algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit OVWeightQuantizationConfig together with a dataset. For example:

quantization_config = OVWeightQuantizationConfig(
    bits=4,
    sym=False,
    ratio=0.8,
    quant_method="awq",
    scale_estimation=True,
    gptq=True,
    dataset="wikitext2"
)

Note: GPTQ and LoRA Correction algorithms can’t be applied simultaneously.

Full quantization

When applying post-training full quantization, both the weights and the activations are quantized. To apply quantization on the activations, an additional calibration step is needed which consists in feeding a calibration_dataset to the network in order to estimate the quantization activations parameters.

Here is how to apply full quantization on a fine-tuned DistilBERT given your own calibration_dataset:

from transformers import AutoTokenizer
from optimum.intel import OVQuantizer, OVModelForSequenceClassification, OVConfig, OVQuantizationConfig

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# The directory where the quantized model will be saved
save_dir = "ptq_model"

quantizer = OVQuantizer.from_pretrained(model)

# Apply full quantization and export the resulting quantized model to OpenVINO IR format
ov_config = OVConfig(quantization_config=OVQuantizationConfig())
quantizer.quantize(ov_config=ov_config, calibration_dataset=calibration_dataset, save_directory=save_dir)
# Save the tokenizer
tokenizer.save_pretrained(save_dir)

The calibration dataset can also be created easily using your OVQuantizer:

from functools import partial

def preprocess_function(examples, tokenizer):
    return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)

# Create the calibration dataset used to perform full quantization
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=300,
    dataset_split="train",
)

The quantize() method applies post-training quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.

Speech-to-text Models Quantization

The speech-to-text Whisper model can be quantized without the need for preparing a custom calibration dataset. Please see example below.

model_id = "openai/whisper-tiny"
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    stateful=False,
    quantization_config=OVQuantizationConfig(
        num_samples=10,
        dataset="librispeech",
        processor=model_id,
        smooth_quant_alpha=0.95,
    )
)

With this, encoder, decoder and decoder-with-past models of the Whisper pipeline will be fully quantized, including activations.

Hybrid quantization

Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights. The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation. Therefore, the proposal is to apply quantization in hybrid mode for the U-Net model and weight-only quantization for the rest of the pipeline components :

U-Net : quantization applied on both the weights and activations
The text encoder, VAE encoder / decoder : quantization applied on the weights

The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.

The quantization_config is utilized to define optimization parameters for optimizing the SD pipeline. To enable hybrid quantization, specify the quantization dataset in the quantization_config. If the dataset is not defined, weight-only quantization will be applied on all components.

from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig

model = OVStableDiffusionPipeline.from_pretrained(
    model_id,
    export=True,
    quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"),
)

For more details, please refer to the corresponding NNCF documentation.

Mixed Quantization

Mixed quantization is a technique that combines weight-only quantization with full quantization. During mixed quantization we separately quantize:

weights of weighted layers to one precision, and
activations (and possibly, weights, if some were skipped at the first step) of other supported layers to another precision.

By default, weights of all weighted layers are quantized in the first step. In the second step activations of weighted and non-weighted layers are quantized. If some layers are instructed to be ignored in the first step with weight_quantization_config.ignored_scope parameter, both weights and activations of these layers are quantized to the precision given in the full_quantization_config.

When running this kind of optimization through Python API, OVMixedQuantizationConfig should be used. In such case the precision for the first step should be provided with weight_quantization_config argument and the precision for the second step with full_quantization_config argument. For example:

model = OVModelForCausalLM.from_pretrained(
    'TinyLlama/TinyLlama-1.1B-Chat-v1.0',
    quantization_config=OVMixedQuantizationConfig(
        weight_quantization_config=OVWeightQuantizationConfig(bits=4, dtype='cb4'),
        full_quantization_config=OVQuantizationConfig(dtype='f8e4m3', dataset='wikitext2')
    )
)

To apply mixed quantization through CLI, the --quant-mode argument should be used. For example:

optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir

Don’t forget to provide a dataset since it is required for the calibration procedure during full quantization.

Pipeline Quantization

There are multimodal pipelines that consist of multiple components, such as Stable Diffusion or Visual Language models. In these cases, there may be a need to apply different quantization methods to different components of the pipeline. For example, you may want to apply int4 data-aware weight-only quantization to a language model in visual-language pipeline, while applying int8 weight-only quantization to other components. In this case you can use the OVPipelineQuantizationConfig class to specify the quantization configuration for each component of the pipeline.

For example, the code below quantizes weights and activations of a language model inside InternVL2-1B, compresses weights of a text embedding model and skips any quantization for vision embedding model.

from optimum.intel import OVModelForVisualCausalLM
from optimum.intel import OVPipelineQuantizationConfig, OVQuantizationConfig, OVWeightQuantizationConfig

model_id = "OpenGVLab/InternVL2-1B"
model = OVModelForVisualCausalLM.from_pretrained(
    model_id,
    export=True,
    trust_remote_code=True,
    quantization_config=OVPipelineQuantizationConfig(
        quantization_configs={
            "lm_model": OVQuantizationConfig(bits=8),
            "text_embeddings_model": OVWeightQuantizationConfig(bits=8),
        },
        dataset="contextual",
    )
)