Optimum documentation

Optimization

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Optimization

Optimum Intel can be used to apply popular compression techniques such as quantization, pruning and knowledge distillation.

Post-training optimization

Post-training compression techniques such as dynamic and static quantization can be easily applied on your model using our IncQuantizer. Note that quantization is currently only supported for CPUs (only CPU backends are available), so we will not be utilizing GPUs / CUDA in the following examples. To apply dynamic quantization on a fine-tuned DistilBERT, we first need to create the corresponding configuration describing the quantization details as well as the quantizer object used to later apply quantization:

from datasets import load_dataset
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
from evaluate import evaluator
from optimum.intel.neural_compressor import IncQuantizationConfig, IncQuantizer

model_id = "distilbert-base-cased-distilled-squad"
max_eval_samples = 100
model = AutoModelForQuestionAnswering.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
eval_dataset = load_dataset("squad", split="validation").select(range(max_eval_samples))
eval = evaluator("question-answering")
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

def eval_func(model):
    qa_pipeline.model = model
    metrics = eval.compute(model_or_pipeline=qa_pipeline, data=eval_dataset, metric="squad")
    return metrics["f1"]

# Load the quantization configuration detailing the quantization we wish to apply
config_name = "echarlaix/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
quantization_config = IncQuantizationConfig.from_pretrained(config_name)

# Instantiate our IncQuantizer using the desired configuration and the evaluation function
# used for the INC accuracy-driven tuning strategy
quantizer = IncQuantizer(quantization_config, eval_func=eval_func)

In a second step, we create our optimizer which will take care of the optimization process.

from optimum.intel.neural_compressor import IncOptimizer

# To instantiate the optimizer, we need the model we wish to optimize and the
# quantizer defining the quantization process
optimizer = IncOptimizer(model, quantizer=quantizer)

# Apply dynamic quantization
quantized_model = optimizer.fit()

# Save the resulting model and its corresponding configuration in the given directory
save_dir = "./quantized_model"
optimizer.save_pretrained(save_dir)

While training optimization

The IncTrainer class provides an API to train your model while combining different compression techniques such as knowledge distillation, pruning and quantization. The IncTrainer is very similar to the 🤗 Transformers Trainer, which can be replaced with minimal changes in your code.

from transformers import TrainingArguments, default_data_collator
-from transformers import Trainer
+from optimum.intel.neural_compressor import IncTrainer

# Initialize our IncTrainer
-trainer = Trainer(
+trainer = IncTrainer(
    model=model,
    args=TrainingArguments(output_dir, num_train_epochs=3.0),
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

Here is an example on how to combine magnitude pruning with dynamic quantization while fine-tuning a DistilBERT First, we create the corresponding configuration describing the quantization and the pruning processes we wish to apply on the model.

from optimum.intel.neural_compressor import IncPruningConfig, IncQuantizationConfig

# The targeted sparsity is set to 10%
target_sparsity = 0.1
config_name = "echarlaix/distilbert-sst2-inc-dynamic-quantization-magnitude-pruning-0.1"
# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = IncQuantizationConfig.from_pretrained(config_name)
# Load the pruning configuration detailing the pruning we wish to apply
pruning_config = IncPruningConfig.from_pretrained(config_name)

Finally we instantiate our optimizer which will take care of the optimization process. The training function train_func will characterize the training step (see for more examples).

from optimum.intel.neural_compressor import IncOptimizer, IncPruner, IncQuantizer

# Instantiate our IncQuantizer using the desired configuration
quantizer = IncQuantizer(quantization_config, eval_func=eval_func)
# Instantiate our IncPruner using the desired configuration
pruner = IncPruner(pruning_config, eval_func=eval_func, train_func=train_func)
optimizer = IncOptimizer(model, quantizer=quantizer, pruner=pruner)
# Apply pruning and quantization 
optimized_model = optimizer.fit()

# Save the resulting model and its corresponding configuration in the given directory
optimizer.save_pretrained(save_dir)

Loading a quantized model

To load a quantized model hosted locally or on the 🤗 hub, you must instantiate you model using our IncQuantizedModelForXxx classes.

from optimum.intel.neural_compressor import IncQuantizedModelForSequenceClassification

model_id = "Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
model = IncQuantizedModelForSequenceClassification.from_pretrained(model_id)

You can load many more quantized models hosted on the hub under the Intel organization here.

Inference with Transformers pipeline

The quantized model can then easily be used to run inference with the Transformers pipelines.

from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe_cls = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "He's a dreadful magician."
outputs = pipe_cls(text)

[{'label': 'NEGATIVE', 'score': 0.9880216121673584}]

Check out the examples directory for more sophisticated usage.