Transformers documentation
GPU
GPU
GPUs are the standard hardware for machine learning because they’re optimized for memory bandwidth and parallelism. With the increasing sizes of modern models, it’s more important than ever to make sure GPUs are capable of efficiently handling and delivering the best possible performance.
This guide will demonstrate a few ways to optimize inference on a GPU. The optimization methods shown below can be combined with each other to achieve even better performance, and they also work for distributed GPUs.
bitsandbytes
bitsandbytes is a quantization library that supports 8-bit and 4-bit quantization. Quantization represents weights in a lower precision compared to the original full precision format. It reduces memory requirements and makes it easier to fit large model into memory.
Make sure bitsandbytes and Accelerate are installed first.
pip install bitsandbytes accelerate
For text generation with 8-bit quantization, you should use generate() instead of the high-level Pipeline API. The Pipeline returns slower performance because it isn’t optimized for 8-bit models, and some sampling strategies (nucleus sampling) also aren’t supported.
Set up a BitsAndBytesConfig and set load_in_8bit=True
to load a model in 8-bit precision. The BitsAndBytesConfig is passed to the quantization_config
parameter in from_pretrained().
Allow Accelerate to automatically distribute the model across your available hardware by setting device_map=“auto”.
Place all inputs on the same device as the model.
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", quantization_config=quantization_config)
prompt = "Hello, my llama is cute"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
For distributed setups, use the max_memory
parameter to create a mapping of the amount of memory to allocate to each GPU. The example below distributes 16GB of memory to the first GPU and 16GB of memory to the second GPU.
max_memory_mapping = {0: "16GB", 1: "16GB"}
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B", device_map="auto", quantization_config=quantization_config, max_memory=max_memory_mapping
)
Learn in more detail the concepts underlying 8-bit quantization in the Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes blog post.
Optimum
Optimum is a Hugging Face library focused on optimizing model performance across various hardware. It supports ONNX Runtime (ORT), a model accelerator, for a wide range of hardware and frameworks including NVIDIA GPUs and AMD GPUs that use the ROCm stack.
ORT uses optimization techniques that fuse common operations into a single node and constant folding to reduce the number of computations. ORT also places the most computationally intensive operations on the GPU and the rest on the CPU to intelligently distribute the workload between the two devices.
Optimum provides the ORTModel class for loading ONNX models. Set the provider
parameter according to the table below.
provider | hardware |
---|---|
CUDAExecutionProvider | CUDA-enabled GPUs |
ROCMExecutionProvider | AMD Instinct, Radeon Pro, Radeon GPUs |
TensorrtExecutionProvider | TensorRT |
For example, load the distilbert/distilbert-base-uncased-finetuned-sst-2-english checkpoint for sequence classification. This checkpoint contains a model.onnx file. If a checkpoint doesn’t have a model.onnx
file, set export=True
to convert a checkpoint on the fly to the ONNX format.
from optimum.onnxruntime import ORTModelForSequenceClassification
ort_model = ORTModelForSequenceClassification.from_pretrained(
"distilbert/distilbert-base-uncased-finetuned-sst-2-english",
#export=True,
provider="CUDAExecutionProvider",
)
Now you can use the model for inference in a Pipeline.
from optimum.pipelines import pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
pipeline = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
result = pipeline("Both the music and visual were astounding, not to mention the actors performance.")
Learn more details about using ORT with Optimum in the Accelerated inference on NVIDIA GPUs and Accelerated inference on AMD GPUs guides.
BetterTransformer
BetterTransformer is a fastpath execution of specialized Transformers functions directly on the hardware level such as a GPU. There are two main components of the fastpath execution.
- fusing multiple operations into a single kernel for faster and more efficient execution
- skipping unnecessary computation of padding tokens with nested tensors
Some BetterTransformer features are being upstreamed to Transformers with default support for native torch.nn.functional.scaled_dot_product_attention (SDPA). BetterTransformer has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to natively support SDPA in Transformers.
BetterTransformer is available through Optimum with to_bettertransformer().
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom")
model = model.to_bettertransformer()
Call reverse_bettertransformer() and save it first to return the model to the original Transformers model.
model = model.reverse_bettertransformer()
model.save_pretrained("saved_model")
Refer to the benchmarks in Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0 for BetterTransformer and scaled dot product attention performance. The BetterTransformer blog post also discusses fastpath execution in greater detail if you’re interested in learning more.
Scaled dot product attention (SDPA)
PyTorch’s torch.nn.functional.scaled_dot_product_attention (SDPA) is a native implementation of the scaled dot product attention mechanism. SDPA is a more efficient and optimized version of the attention mechanism used in transformer models.
There are three supported implementations available.
- FlashAttention2 only supports models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate type first.
- xFormers or Memory-Efficient Attention is able to support models with the fp32 torch type.
- C++ implementation of scaled dot product attention
SDPA is used by default for PyTorch v2.1.1. and greater when an implementation is available. You could explicitly enable SDPA by setting attn_implementation="sdpa"
in from_pretrained() though. Certain attention parameters, such as head_mask
and output_attentions=True
, are unsupported and returns a warning that Transformers will fall back to the (slower) eager implementation.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", attn_implementation="sdpa")
SDPA selects the most performant implementation available, but you can also explicitly select an implementation with torch.nn.attention.sdpa_kernel as a context manager. The example below shows how to enable the FlashAttention2 implementation with enable_flash=True
.
import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto").to("cuda")
input_text = "Hello, my llama is cute"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
with sdpa_kernel(SDPBackend.FLASH_ATTENTION)::
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
If you encounter the following RuntimeError
, try installing the nightly version of PyTorch which has broader coverage for FlashAttention.
RuntimeError: No available kernel. Aborting execution. pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
FlashAttention
FlashAttention is also available as a standalone package. It can significantly speed up inference by:
- additionally parallelizing the attention computation over sequence length
- partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them
Install FlashAttention first for the hardware you’re using.
pip install flash-attn --no-build-isolation
Enable FlashAttention2 by setting attn_implementation="flash_attention_2"
in from_pretrained(). FlashAttention2 is only supported for models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate data type first.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
Benchmarks
FlashAttention2 speeds up inference considerably especially for inputs with long sequences. However, since FlashAttention2 doesn’t support computing attention scores with padding tokens, you must manually pad and unpad the attention scores for batched inference if a sequence contains padding tokens. The downside is batched generation is slower with padding tokens.
With a relatively small sequence length, a single forward pass creates overhead leading to a small speed up. The graph below shows the expected speed up for a single forward pass with meta-llama/Llama-7b-hf with padding.

To avoid this slowdown, use FlashAttention2 without padding tokens in the sequence during training. Pack the dataset or concatenate sequences until reaching the maximum sequence length.
The graph below shows the expected speed up for a single forward pass with tiiuae/falcon-7b with a sequence length of 4096 and various batch sizes without padding tokens.
