GPU

GPUs are the standard hardware for machine learning because they’re optimized for memory bandwidth and parallelism. With the increasing sizes of modern models, it’s more important than ever to make sure GPUs are capable of efficiently handling and delivering the best possible performance.

This guide will demonstrate a few ways to optimize inference on a GPU. The optimization methods shown below can be combined with each other to achieve even better performance, and they also work for distributed GPUs.

bitsandbytes

bitsandbytes is a quantization library that supports 8-bit and 4-bit quantization. Quantization represents weights in a lower precision compared to the original full precision format. It reduces memory requirements and makes it easier to fit large model into memory.

Make sure bitsandbytes and Accelerate are installed first.

pip install bitsandbytes accelerate

8-bit

4-bit

Optimum

Optimum is a Hugging Face library focused on optimizing model performance across various hardware. It supports ONNX Runtime (ORT), a model accelerator, for a wide range of hardware and frameworks including NVIDIA GPUs and AMD GPUs that use the ROCm stack.

ORT uses optimization techniques that fuse common operations into a single node and constant folding to reduce the number of computations. ORT also places the most computationally intensive operations on the GPU and the rest on the CPU to intelligently distribute the workload between the two devices.

Optimum provides the ORTModel class for loading ONNX models. Set the provider parameter according to the table below.

provider	hardware
CUDAExecutionProvider	CUDA-enabled GPUs
ROCMExecutionProvider	AMD Instinct, Radeon Pro, Radeon GPUs
TensorrtExecutionProvider	TensorRT

For example, load the distilbert/distilbert-base-uncased-finetuned-sst-2-english checkpoint for sequence classification. This checkpoint contains a model.onnx file. If a checkpoint doesn’t have a model.onnx file, set export=True to convert a checkpoint on the fly to the ONNX format.

from optimum.onnxruntime import ORTModelForSequenceClassification

ort_model = ORTModelForSequenceClassification.from_pretrained(
  "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
  #export=True,
  provider="CUDAExecutionProvider",
)

Now you can use the model for inference in a Pipeline.

from optimum.pipelines import pipeline
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
pipeline = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
result = pipeline("Both the music and visual were astounding, not to mention the actors performance.")

Learn more details about using ORT with Optimum in the Accelerated inference on NVIDIA GPUs and Accelerated inference on AMD GPUs guides.

Scaled dot product attention (SDPA)

PyTorch’s torch.nn.functional.scaled_dot_product_attention (SDPA) is a native implementation of the scaled dot product attention mechanism. SDPA is a more efficient and optimized version of the attention mechanism used in transformer models.

There are three supported implementations available.

FlashAttention2 only supports models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate type first.
xFormers or Memory-Efficient Attention is able to support models with the fp32 torch type.
C++ implementation of scaled dot product attention

SDPA is used by default for PyTorch v2.1.1. and greater when an implementation is available. You could explicitly enable SDPA by setting attn_implementation="sdpa" in from_pretrained() though. Certain attention parameters, such as output_attentions=True, are unsupported and returns a warning that Transformers will fall back to the (slower) eager implementation.

Refer to the AttentionInterface guide to learn how to change the attention implementation after loading a model.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", attn_implementation="sdpa")

# Change the model's attention dynamically after loading it
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto")
model.set_attention_implementation("sdpa")

SDPA selects the most performant implementation available, but you can also explicitly select an implementation with torch.nn.attention.sdpa_kernel as a context manager. The example below shows how to enable the FlashAttention2 implementation with enable_flash=True.

import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto")

input_text = "Hello, my llama is cute"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
    outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

If you encounter the following RuntimeError, try installing the nightly version of PyTorch which has broader coverage for FlashAttention.

RuntimeError: No available kernel. Aborting execution.

pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

FlashAttention

FlashAttention is also available as a standalone package. It can significantly speed up inference by:

additionally parallelizing the attention computation over sequence length
partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them

Install FlashAttention first for the hardware you’re using.

NVIDIA

AMD

Enable FlashAttention2 by setting attn_implementation="flash_attention_2" in from_pretrained() or by setting model.set_attention_implementation("flash_attention_2") to dynamically update the attention interface. FlashAttention2 is only supported for models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate data type first.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", dtype=torch.bfloat16, attn_implementation="flash_attention_2")

Benchmarks

FlashAttention2 speeds up inference considerably especially for inputs with long sequences. However, since FlashAttention2 doesn’t support computing attention scores with padding tokens, you must manually pad and unpad the attention scores for batched inference if a sequence contains padding tokens. The downside is batched generation is slower with padding tokens.

short sequence length

long sequence length

To avoid this slowdown, use FlashAttention2 without padding tokens in the sequence during training. Pack the dataset or concatenate sequences until reaching the maximum sequence length.

tiiuae/falcon-7b

meta-llama/Llama-7b-hf

Update on GitHub