Transformers documentation

GPU

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.49.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

GPU

GPUs are the standard hardware for machine learning because they’re optimized for memory bandwidth and parallelism. With the increasing sizes of modern models, it’s more important than ever to make sure GPUs are capable of efficiently handling and delivering the best possible performance.

This guide will demonstrate a few ways to optimize inference on a GPU. The optimization methods shown below can be combined with each other to achieve even better performance, and they also work for distributed GPUs.

bitsandbytes

bitsandbytes is a quantization library that supports 8-bit and 4-bit quantization. Quantization represents weights in a lower precision compared to the original full precision format. It reduces memory requirements and makes it easier to fit large model into memory.

Make sure bitsandbytes and Accelerate are installed first.

pip install bitsandbytes accelerate
8-bit
4-bit

For text generation with 8-bit quantization, you should use generate() instead of the high-level Pipeline API. The Pipeline returns slower performance because it isn’t optimized for 8-bit models, and some sampling strategies (nucleus sampling) also aren’t supported.

Set up a BitsAndBytesConfig and set load_in_8bit=True to load a model in 8-bit precision. The BitsAndBytesConfig is passed to the quantization_config parameter in from_pretrained().

Allow Accelerate to automatically distribute the model across your available hardware by setting device_map=“auto”.

Place all inputs on the same device as the model.

from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", quantization_config=quantization_config)

prompt = "Hello, my llama is cute"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

For distributed setups, use the max_memory parameter to create a mapping of the amount of memory to allocate to each GPU. The example below distributes 16GB of memory to the first GPU and 16GB of memory to the second GPU.

max_memory_mapping = {0: "16GB", 1: "16GB"}
model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B", device_map="auto", quantization_config=quantization_config, max_memory=max_memory_mapping
)

Learn in more detail the concepts underlying 8-bit quantization in the Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes blog post.

Optimum

Optimum is a Hugging Face library focused on optimizing model performance across various hardware. It supports ONNX Runtime (ORT), a model accelerator, for a wide range of hardware and frameworks including NVIDIA GPUs and AMD GPUs that use the ROCm stack.

ORT uses optimization techniques that fuse common operations into a single node and constant folding to reduce the number of computations. ORT also places the most computationally intensive operations on the GPU and the rest on the CPU to intelligently distribute the workload between the two devices.

Optimum provides the ORTModel class for loading ONNX models. Set the provider parameter according to the table below.

provider hardware
CUDAExecutionProvider CUDA-enabled GPUs
ROCMExecutionProvider AMD Instinct, Radeon Pro, Radeon GPUs
TensorrtExecutionProvider TensorRT

For example, load the distilbert/distilbert-base-uncased-finetuned-sst-2-english checkpoint for sequence classification. This checkpoint contains a model.onnx file. If a checkpoint doesn’t have a model.onnx file, set export=True to convert a checkpoint on the fly to the ONNX format.

from optimum.onnxruntime import ORTModelForSequenceClassification

ort_model = ORTModelForSequenceClassification.from_pretrained(
  "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
  #export=True,
  provider="CUDAExecutionProvider",
)

Now you can use the model for inference in a Pipeline.

from optimum.pipelines import pipeline
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
pipeline = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
result = pipeline("Both the music and visual were astounding, not to mention the actors performance.")

Learn more details about using ORT with Optimum in the Accelerated inference on NVIDIA GPUs and Accelerated inference on AMD GPUs guides.

BetterTransformer

BetterTransformer is a fastpath execution of specialized Transformers functions directly on the hardware level such as a GPU. There are two main components of the fastpath execution.

  • fusing multiple operations into a single kernel for faster and more efficient execution
  • skipping unnecessary computation of padding tokens with nested tensors

Some BetterTransformer features are being upstreamed to Transformers with default support for native torch.nn.functional.scaled_dot_product_attention (SDPA). BetterTransformer has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to natively support SDPA in Transformers.

BetterTransformer is available through Optimum with to_bettertransformer().

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom")
model = model.to_bettertransformer()

Call reverse_bettertransformer() and save it first to return the model to the original Transformers model.

model = model.reverse_bettertransformer()
model.save_pretrained("saved_model")

Refer to the benchmarks in Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0 for BetterTransformer and scaled dot product attention performance. The BetterTransformer blog post also discusses fastpath execution in greater detail if you’re interested in learning more.

Scaled dot product attention (SDPA)

PyTorch’s torch.nn.functional.scaled_dot_product_attention (SDPA) is a native implementation of the scaled dot product attention mechanism. SDPA is a more efficient and optimized version of the attention mechanism used in transformer models.

There are three supported implementations available.

  • FlashAttention2 only supports models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate type first.
  • xFormers or Memory-Efficient Attention is able to support models with the fp32 torch type.
  • C++ implementation of scaled dot product attention

SDPA is used by default for PyTorch v2.1.1. and greater when an implementation is available. You could explicitly enable SDPA by setting attn_implementation="sdpa" in from_pretrained() though. Certain attention parameters, such as head_mask and output_attentions=True, are unsupported and returns a warning that Transformers will fall back to the (slower) eager implementation.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", attn_implementation="sdpa")

SDPA selects the most performant implementation available, but you can also explicitly select an implementation with torch.nn.attention.sdpa_kernel as a context manager. The example below shows how to enable the FlashAttention2 implementation with enable_flash=True.

import torch
from torch.nn.attention import SDPBackend, sdpa_kernel
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto").to("cuda")

input_text = "Hello, my llama is cute"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

with sdpa_kernel(SDPBackend.FLASH_ATTENTION)::
    outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

If you encounter the following RuntimeError, try installing the nightly version of PyTorch which has broader coverage for FlashAttention.

RuntimeError: No available kernel. Aborting execution.

pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

FlashAttention

FlashAttention is also available as a standalone package. It can significantly speed up inference by:

  1. additionally parallelizing the attention computation over sequence length
  2. partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them

Install FlashAttention first for the hardware you’re using.

NVIDIA
AMD
pip install flash-attn --no-build-isolation

Enable FlashAttention2 by setting attn_implementation="flash_attention_2" in from_pretrained(). FlashAttention2 is only supported for models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate data type first.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")

Benchmarks

FlashAttention2 speeds up inference considerably especially for inputs with long sequences. However, since FlashAttention2 doesn’t support computing attention scores with padding tokens, you must manually pad and unpad the attention scores for batched inference if a sequence contains padding tokens. The downside is batched generation is slower with padding tokens.

short sequence length
long sequence length

With a relatively small sequence length, a single forward pass creates overhead leading to a small speed up. The graph below shows the expected speed up for a single forward pass with meta-llama/Llama-7b-hf with padding.

To avoid this slowdown, use FlashAttention2 without padding tokens in the sequence during training. Pack the dataset or concatenate sequences until reaching the maximum sequence length.

tiiuae/falcon-7b
meta-llama/Llama-7b-hf

The graph below shows the expected speed up for a single forward pass with tiiuae/falcon-7b with a sequence length of 4096 and various batch sizes without padding tokens.

< > Update on GitHub