Optimum documentation

Optimization

You are viewing v1.6.1 version. A newer version v1.20.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Optimization

Optimum Intel can be used to apply popular compression techniques such as quantization, pruning and knowledge distillation.

Post-training optimization

Post-training compression techniques such as dynamic and static quantization can be easily applied on your model using our IncQuantizer. Note that quantization is currently only supported for CPUs (only CPU backends are available), so we will not be utilizing GPUs / CUDA in the following examples.

Dynamic quantization

To apply dynamic quantization on a fine-tuned DistilBERT, we first need to create the corresponding configuration describing the quantization details as well as the quantizer object used to later apply quantization:

from transformers import AutoModelForQuestionAnswering
from neural_compressor.config import PostTrainingQuantConfig
from optimum.intel.neural_compressor import INCQuantizer

model_name = "distilbert-base-cased-distilled-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# The directory where the quantized model will be saved
save_dir = "dynamic_quantization"

# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = PostTrainingQuantConfig(approach="dynamic")
quantizer = INCQuantizer.from_pretrained(model)
# Apply dynamic quantization and save the resulting model
quantizer.quantize(quantization_config=quantization_config, save_directory=save_dir)

The accuracy tolerance along with an adapted evaluation function can also be specified in order to find a quantized model meeting the specified accuracy tolerance.

import evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, pipeline
from neural_compressor.config import AccuracyCriterion, TuningCriterion

tokenizer = AutoTokenizer.from_pretrained(model_name)
eval_dataset = load_dataset("squad", split="validation").select(range(64))
task_evaluator = evaluate.evaluator("question-answering")
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

def eval_fn(model):
    qa_pipeline.model = model
    metrics = task_evaluator.compute(model_or_pipeline=qa_pipeline, data=eval_dataset, metric="squad")
    return metrics["f1"]

# Set the accepted accuracy loss to 5%
accuracy_criterion = AccuracyCriterion(tolerable_loss=0.05)
# Set the maximum number of trials to 10
tuning_criterion = TuningCriterion(max_trials=10)
quantization_config = PostTrainingQuantConfig(
    approach="dynamic", accuracy_criterion=accuracy_criterion, tuning_criterion=tuning_criterion
)
quantizer = INCQuantizer.from_pretrained(model, eval_fn=eval_fn)
quantizer.quantize(quantization_config=quantization_config, save_directory=save_dir)

Static quantization

In the same maneer we can apply static quantization, for which we also need to generate the calibration dataset in order to perform the calibration step.

from functools import partial
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from neural_compressor.config import PostTrainingQuantConfig
from optimum.intel.neural_compressor import INCQuantizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# The directory where the quantized model will be saved
save_dir = "static_quantization"

def preprocess_function(examples, tokenizer):
    return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)

# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = PostTrainingQuantConfig(approach="static")
quantizer = INCQuantizer.from_pretrained(model)
# Generate the calibration dataset needed for the calibration step
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
)
quantizer = INCQuantizer.from_pretrained(model)
# Apply static quantization and save the resulting model
quantizer.quantize(
    quantization_config=quantization_config,
    calibration_dataset=calibration_dataset,
    save_directory=save_dir,
)

While training optimization

The INCTrainer class provides an API to train your model while combining different compression techniques such as knowledge distillation, pruning and quantization. The INCTrainer is very similar to the 🤗 Transformers Trainer, which can be replaced with minimal changes in your code.

from transformers import TrainingArguments, default_data_collator
-from transformers import Trainer
+from optimum.intel.neural_compressor import INCTrainer
+from neural_compressor import QuantizationAwareTrainingConfig

# Load the quantization configuration detailing the quantization we wish to apply
+quantization_config = QuantizationAwareTrainingConfig()

# The directory where the quantized model will be saved
output_dir = "quantized_model"

-trainer = Trainer(
+trainer = INCTrainer(
    model=model,
+   quantization_config=quantization_config,
    args=TrainingArguments(output_dir, num_train_epochs=3.0),
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)

# Save the PyTorch checkpoint
+trainer.save_model()

For pruning, we support snip_momentum(default), snip_momentum_progressive, magnitude, magnitude_progressive, gradient, gradient_progressive, snip, snip_progressive and pattern_lock. You can refer to the pruning details.

Note: At present, neural_compressor only support to prune linear and conv ops. So if we set a target sparsity is 0.9, it means that the pruning op’s sparsity will be 0.9, not the whole model’s sparsity is 0.9. For example: the embedding ops will not be pruned in the model.

For distillation, we support knowledge distillation, intermediate layer knowledge distillation and self distillation. You can refer to the distillation details

Loading a quantized model

To load a quantized model hosted locally or on the 🤗 hub, you must instantiate you model using our INCModelForXxx classes.

from optimum.intel.neural_compressor import INCModelForSequenceClassification

model_name = "Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
model = INCModelForSequenceClassification.from_pretrained(model_name)

You can load many more quantized models hosted on the hub under the Intel organization here.

Inference with Transformers pipeline

The quantized model can then easily be used to run inference with the Transformers pipelines.

from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe_cls = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "He's a dreadful magician."
outputs = pipe_cls(text)

[{'label': 'NEGATIVE', 'score': 0.9880216121673584}]

Check out the examples directory for more sophisticated usage.