Transformers

Efficient Inference on CPU

This guide focuses on inferencing large models efficiently on CPU.

PyTorch JIT-mode (TorchScript)

TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. Comparing to default eager mode, jit mode in PyTorch normally yields better performance for model inference from optimization methodologies like operator fusion.

For a gentle introduction to TorchScript, see the Introduction to PyTorch TorchScript tutorial.

IPEX Graph Optimization with JIT-mode

Intel® Extension for PyTorch provides further optimizations in jit mode for Transformers series models. It is highly recommended for users to take advantage of Intel® Extension for PyTorch with jit mode. Some frequently used operator patterns from Transformers models are already supported in Intel® Extension for PyTorch with jit mode fusions. Those fusion patterns like Multi-head-attention fusion, Concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm fusion and etc. are enabled and perform well. The benefit of the fusion is delivered to users in a transparent fashion. According to the analysis, ~70% of most popular NLP tasks in question-answering, text-classification, and token-classification can get performance benefits with these fusion patterns for both Float32 precision and BFloat16 Mixed precision.

Check more detailed information for IPEX Graph Optimization.

IPEX installation:

IPEX release is following PyTorch, check the approaches for IPEX installation.

Usage of JIT-mode

To enable jit mode in Trainer, users should add `jit_mode_eval` in Trainer command arguments.

Take an example of the use cases on Transformers question-answering

Inference using jit mode on CPU:

python run_qa.py \
--model_name_or_path csarron/bert-base-uncased-squad-v1 \
--dataset_name squad \
--do_eval \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/ \
--no_cuda \
--jit_mode_eval

Inference with IPEX using jit mode on CPU:

python run_qa.py \
--model_name_or_path csarron/bert-base-uncased-squad-v1 \
--dataset_name squad \
--do_eval \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/ \
--no_cuda \
--use_ipex \
--jit_mode_eval