Quantization of Transformer Models with Neural Compressor

Community blog post
Published February 1, 2024



In the ever-evolving landscape of natural language processing (NLP), Hugging Face* Transformers stands as a beacon of innovation. This open-source NLP library, named after the groundbreaking Transformers architecture, has reshaped the way we approach language-based tasks. Intel, in collaboration with Hugging Face, introduces cutting-edge techniques like quantization with Neural Compressor to optimize model performance on Intel® platforms.



Before we embark on this transformative journey, let's understand key terms:

  1. INCTrainer and INCQuantizer: These are custom classes extending Transformers' Trainer, facilitating quantization-aware training and post-training quantization, respectively.

  2. Optimum Library: Intel's suite of performance optimization tools, enhancing the capabilities of the Optimum library, seamlessly combined with Hugging Face Transformers.

  3. Quantization: A technique to compress models by reducing the precision of weights and activations, enhancing efficiency without compromising accuracy.

Benefits of Quantization with Neural Compressor:

Why should you consider quantization for your Hugging Face Transformer models using Intel's Neural Compressor? Let's explore the compelling advantages:

  1. Optimal Performance: Integration with Intel's Optimum library ensures optimal performance on Intel® platforms, unlocking the full potential of your models.

  2. Seamless Deployment: After the compression process, effortlessly deploy models using Intel Runtime, including quantized models with Intel® Extension for PyTorch*, Intel® Extension for Transformers*, and OpenVINO™ toolkit.

  3. Flexible Configuration: Tailor compression configurations using INCQuantizer, specifying quantization, pruning, and distillation settings for your unique requirements.

  4. ONNX Export: Convert your PyTorch models into an Open Neural Network Exchange (ONNX*) format, expanding applicability across various frameworks.

  5. User-Friendly Interface: The Optimum library provides user-friendly Python command-line interfaces for compression examples, ensuring accessibility and ease of use.


Code Implementation

Let's walk through a practical example of Quantization using Neural Compressor


Step I: Install Libraries

!pip install transformers datasets evaluate  accelerate optimum[neural-compressor] -qU

Step II: Import and Load Dataset

## Import Libraries
import transformers
import evaluate
import numpy as np
import random

from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, Trainer, TrainingArguments
from datasets import load_metric, load_dataset
from transformers.utils import send_example_telemetry
from optimum.intel.version import __version__

send_example_telemetry("classification_notebook", framework="pytorch")


# Defining a constant SEED for reproducibility in random operations
SEED = 42

# Setting the seed for the random library to ensure consistent results

MODEL = 'distilbert-base-cased'

## Load the Dataset
# Importing the ClassLabel module to represent categorical class labels
from datasets import ClassLabel

# Loading the 'app_reviews' dataset's training split into the 'dataset' variable
dataset = load_dataset('app_reviews', split='train')

# Converting the 'star' column in our dataset to a ClassLabel type
# This allows for categorical representation and easier handling of classes
dataset = dataset.class_encode_column('star')

# Split the Dataset into Train-Test-Val
# Splitting the dataset into a training set and a test set.
# We reserve 20% of the data for testing and use stratification on the 'star' column
# to ensure both sets have an equal distribution of each star category.
dataset = dataset.train_test_split(test_size=0.2, seed=SEED, stratify_by_column='star')

# Now, we further split our training dataset to reserve 25% of it for validation.
# Again, we stratify by the 'star' column to keep the distribution consistent.
df = dataset['train'].train_test_split(test_size=.25, seed=SEED, stratify_by_column='star')

# Assigning the split datasets to their respective keys:
# - The remaining 75% of our initial training data becomes the new training dataset.
dataset['train'] = df['train']

# - The 25% split from our initial training data becomes the validation dataset.
dataset['val'] = df['test']

# Displaying the dataset to see the distribution across train, test, and validation sets.

Step III: Processing Dataset

tokenizer = AutoTokenizer.from_pretrained(MODEL)

#### simple function to batch tokenize utterances with truncation
def preprocess_function(examples):  # each example is an element from the Dataset
    return tokenizer(examples["review"], truncation=True)

#### DataCollatorWithPadding creates batch of data. It also dynamically pads text to the 
####  length of the longest element in the batch, making them all the same length. 
####  It's possible to pad your text in the tokenizer function with padding=True, dynamic padding is more efficient.

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

dataset = dataset.map(preprocess_function, batched=True)

dataset = dataset.rename_column("star", "label")
dataset = dataset.remove_columns(['package_name', 'review', 'date'])

Step IV: Applying quantization on the model

model = AutoModelForSequenceClassification.from_pretrained(MODEL,

#### To instantiate an INCTrainer, we will need to define three more things. First, we need to create the quantization configuration describing the quantization proccess we wish to apply. Quantization will be applied on the embeddings, on the linear layers as well as on their corresponding input activations.
from neural_compressor import QuantizationAwareTrainingConfig

quantization_config = QuantizationAwareTrainingConfig()

STEP V: Training Args and Computer Metrics

epochs = 2
save_directory = f"{MODEL.split('/')[-1]}-finetuned-task"
training_args = TrainingArguments(
    # some deep learning parameters that the Trainer is able to take in
    weight_decay = 0.05,

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {"accuracy": (preds == p.label_ids).mean()}

STEP VI: Training

import copy
from optimum.intel.neural_compressor import INCTrainer

trainer = INCTrainer(
    task="sequence-classification", # optional : only needed to export the model to the ONNX format
fp_model = copy.deepcopy(model)

Step VII: Loading the Quantized Model

from optimum.intel.neural_compressor import INCModelForSequenceClassification
from optimum.onnxruntime import ORTModelForSequenceClassification

pytorch_model = INCModelForSequenceClassification.from_pretrained(save_directory)
onnx_model = ORTModelForSequenceClassification.from_pretrained(save_directory)


As you venture into the realm of optimizing Hugging Face Transformer models, Intel's Neural Compressor emerges as a game-changer. Unleash the true potential of your models, achieve unparalleled performance, and seamlessly deploy them on Intel® platforms. The combination of Hugging Face's innovation and Intel's optimization prowess opens doors to a new era in natural language processing. Elevate your NLP endeavors with quantization that goes beyond conventional limits, ensuring your models not only perform better but inspire a legacy in the hearts of users. Embrace the future of NLP optimization with Intel's Neural Compressor – where innovation meets inspiration.

“Stay connected and support my work through various platforms:

Medium: You can read my latest articles and insights on Medium at https://medium.com/@andysingal

Paypal: Enjoyed my article? Buy me a coffee! https://paypal.me/alphasingal?country.x=US&locale.x=en_US"

Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.