Quantization

🤗 Optimum provides an optimum.intel.neural_compressor package that enables you to apply quantization on many model hosted on the 🤗 hub using the Intel Neural Compressor quantization API.

IncQuantizer

The IncQuantizer class allows to apply different quantization approaches such as static, dynamic and aware training quantization using pytorch eager or fx graph mode.

class optimum.intel.IncQuantizer

< source >

( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] config_path_or_obj: typing.Union[str, optimum.intel.neural_compressor.configuration.IncQuantizationConfig] tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None eval_func: typing.Optional[typing.Callable] = None train_func: typing.Optional[typing.Callable] = None calib_dataloader: typing.Optional[torch.utils.data.dataloader.DataLoader] = None )

from_config

< source >

( model_name_or_path: str inc_config: typing.Union[optimum.intel.neural_compressor.configuration.IncQuantizationConfig, str, NoneType] = None config_name: str = None **kwargs ) → quantizer

Parameters

model_name_or_path (str) — Repository name in the Hugging Face Hub or path to a local directory hosting the model.
inc_config (Union[IncQuantizationConfig, str], optional) — Configuration file containing all the information related to the model quantization. Can be either:
- an instance of the class IncQuantizationConfig,
- a string valid as input to IncQuantizationConfig.from_pretrained.
config_name (str, optional) — Name of the configuration file.
cache_dir (str, optional) — Path to a directory in which a downloaded configuration should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the configuration files and override the cached versions if they exist.
resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
revision(str, optional) — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.
calib_dataloader (DataLoader, optional) — DataLoader for post-training quantization calibration.
eval_func (Callable, optional) — Evaluation function to evaluate the tuning objective.
train_func (Callable, optional) — Training function for quantization aware training approach.

Returns

quantizer

IncQuantizer object.

Instantiate a IncQuantizer object from a configuration file which can either be hosted on huggingface.co or from a local directory path.

IncQuantizedModel

The IncQuantizedModel class allows to load a quantized pytorch model from a given configuration file summarizing the quantization performed by Intel® Neural Compressor.

class optimum.intel.neural_compressor.quantization.IncQuantizedModel

< source >

( *args **kwargs )

from_pretrained

< source >

( model_name_or_path: str inc_config: typing.Union[optimum.intel.neural_compressor.configuration.IncOptimizedConfig, str] = None q_model_name: typing.Optional[str] = None input_names: typing.Optional[typing.List[str]] = None batch_size: typing.Optional[int] = None sequence_length: typing.Union[int, typing.List[int], typing.Tuple[int], NoneType] = None num_choices: typing.Optional[int] = -1 **kwargs ) → q_model

Parameters

model_name_or_path (str) — Repository name in the Hugging Face Hub or path to a local directory hosting the model.
inc_config (Union[IncOptimizedConfig, str], optional) — Configuration file containing all the information related to the model quantization. Can be either:
- an instance of the class IncOptimizedConfig,
- a string valid as input to IncOptimizedConfig.from_pretrained.
q_model_name (str, optional) — Name of the state dictionary located in model_name_or_path used to load the quantized model. If state_dict is specified, the latter will not be used.
input_names (List[str], optional) — List of names of the inputs used when tracing the model. If unset, model.dummy_inputs().keys() are used instead.
batch_size (int, optional) — Batch size of the traced model inputs.
sequence_length (Union[int, List[int], Tuple[int]], optional) — Sequence length of the traced model inputs. For sequence-to-sequence models with different sequence lengths between the encoder and the decoder inputs, this must be [encoder_sequence_length, decoder_sequence_length].
num_choices (int, optional, defaults to -1) — The number of possible choices for a multiple choice task.
cache_dir (str, optional) — Path to a directory in which a downloaded configuration should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the configuration files and override the cached versions if they exist.
resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
revision(str, optional) — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.
state_dict (Dict[str, torch.Tensor], optional) — State dictionary of the quantized model, if not specified q_model_name will be used to load the state dictionary.

Returns

q_model

Quantized model.

Instantiate a quantized pytorch model from a given Intel Neural Compressor (INC) configuration file.