Scaling Transformer Model Performance with Intel AI
Intel and Hugging Face are building powerful optimization tools to accelerate training and inference with Transformers. Learn how the Intel AI Software Portfolio and Xeon Scalable processors can help you achieve the best performance and productivity on your models.
Intel and Hugging Face are collaborating to build state-of-the-art hardware and software acceleration to train, fine-tune and predict with Transformers. The hardware acceleration is driven by Intel Xeon Scalable CPU platform and the software acceleration through a rich suite of optimized AI software tools, frameworks, and libraries.
from transformers import AutoModelForQuestionAnswering
from neural_compressor.config import PostTrainingQuantConfig
from optimum.intel.neural_compressor import INCQuantizer, INCModelForQuestionAnswering
model_name = "distilbert-base-cased-distilled-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# The directory where the quantized model will be saved
save_dir = "quantized_model"
# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = PostTrainingQuantConfig(approach="dynamic")
quantizer = INCQuantizer.from_pretrained(model)
# Apply dynamic quantization and save the resulting model
# Load the resulting quantized model, which can be hosted on the HF hub or locally
loaded_model = INCModelForQuestionAnswering.from_pretrained(save_dir)
from transformers import AutoFeatureExtractor, pipeline
from optimum.intel.openvino import OVModelForImageClassification
model_id = "google/vit-base-patch16-224"
# Load a model from the HF hub and convert it to the OpenVINO format
model = OVModelForImageClassification.from_pretrained(model_id, from_transformers=True)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
cls_pipeline = pipeline("image-classification", model=model, feature_extractor=feature_extractor)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
# Run inference with OpenVINO Runtime using Transformers pipelines
outputs = cls_pipeline(url)
Easily optimize models for production
Optimum Intel is the interface between Hugging Face's Transformers library and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures. Intel Neural Compressor is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. OpenVINO is an open-source toolkit enabling to optimize and deploy your model with high performance inference capabilities on Intel devices. With Optimum Intel, you can apply state-of-the-art optimization techniques on your Transformer models with minimal effort.
Get high performance on CPU instances
3rd Generation Intel® Xeon® Scalable processors offer a balanced architecture that delivers built-in AI acceleration and advanced security capabilities. This allows you to place your transformer workloads where they perform best while minimizing costs.
Quickly go from concept to scale
With hardware and software optimized for AI workloads, an open, familiar, standards-based software environment and the hardware flexibility you need to create the deployment you want, Intel can help accelerate your time to production.