Scaling Transformer Model Performance with Intel AI

Intel and Hugging Face are building powerful optimization tools to accelerate training and inference with Transformers. Learn how the Intel AI Software Portfolio and Xeon Scalable processors can help you achieve the best performance and productivity on your models.

Democratizing Machine Learning Acceleration

Intel and Hugging Face are collaborating to build state-of-the-art hardware and software acceleration to train, fine-tune and predict with Transformers. The hardware acceleration is driven by Intel Xeon Scalable CPU platform and the software acceleration through a rich suite of optimized AI software tools, frameworks, and libraries.

huggingface@intel:~
from transformers import AutoModelForQuestionAnswering
from neural_compressor.config import PostTrainingQuantConfig
from optimum.intel.neural_compressor import INCQuantizer, INCModelForQuestionAnswering

model_name = "distilbert-base-cased-distilled-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# The directory where the quantized model will be saved
save_dir = "quantized_model"
# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = PostTrainingQuantConfig(approach="dynamic")
quantizer = INCQuantizer.from_pretrained(model)
# Apply dynamic quantization and save the resulting model
quantizer.quantize(quantization_config=quantization_config, save_directory=save_dir)

# Load the resulting quantized model, which can be hosted on the HF hub or locally
loaded_model = INCModelForQuestionAnswering.from_pretrained(save_dir)
huggingface@intel:~
from transformers import AutoFeatureExtractor, pipeline
from optimum.intel.openvino import OVModelForImageClassification

model_id = "google/vit-base-patch16-224"
# Load a model from the HF hub and convert it to the OpenVINO format
model = OVModelForImageClassification.from_pretrained(model_id, from_transformers=True)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
cls_pipeline = pipeline("image-classification", model=model, feature_extractor=feature_extractor)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
# Run inference with OpenVINO Runtime using Transformers pipelines
outputs = cls_pipeline(url)

Scale Transformer Workloads with Intel AI

Hardware performance and developer productivity at unmatched scale

Easily optimize models for production

Optimum Intel is the interface between Hugging Face's Transformers library and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures. Intel Neural Compressor is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. OpenVINO is an open-source toolkit enabling to optimize and deploy your model with high performance inference capabilities on Intel devices. With Optimum Intel, you can apply state-of-the-art optimization techniques on your Transformer models with minimal effort.

Learn more about Optimum Intel

Easily optimize models for production

Get high performance on CPU instances

3rd Generation Intel® Xeon® Scalable processors offer a balanced architecture that delivers built-in AI acceleration and advanced security capabilities. This allows you to place your transformer workloads where they perform best while minimizing costs.

Learn more about Intel AI Hardware

Get high performance on CPU instances

Quickly go from concept to scale

With hardware and software optimized for AI workloads, an open, familiar, standards-based software environment and the hardware flexibility you need to create the deployment you want, Intel can help accelerate your time to production.

Explore Intel Developer Zone

Quickly go from concept to scale

Accelerating Transformer Performance with Intel AI

Learn more about accelerating Hugging Face models with Intel hardware and software