GPU

GPUs are commonly used to train deep learning models due to their high memory bandwidth and parallel processing capabilities. Depending on your GPU and model size, it is possible to even train models with billions of parameters. The key is to find the right balance between GPU memory utilization (data throughput/training time) and training speed.

This guide will show you the features available in Transformers and PyTorch for efficiently training a model on GPUs. In many cases, you’ll want to use a combination of these features to optimize training.

Refer to the table below to quickly help you identify the features relevant to your training scenario.

Feature	Training speed	Memory usage
batch size	yes	yes
gradient accumulation	no	yes
gradient checkpointing	no	yes
mixed precision	yes	depends
optimizers	yes	yes
data preloading	yes	no
torch_empty_cache_steps	no	yes
torch.compile	yes	no
scaled dot production attention (SDPA)	yes	yes

Trainer

Trainer supports many useful training features that can be configured through TrainingArguments. This section highlights some of the more important features for optimizing training.

Batch size

Batch size is one of the most important hyperparameters for efficient GPU training because it affects memory usage and training speed. Larger batch sizes lead to faster training because it takes advantage of a GPUs parallel processing power. It is recommended to use batch sizes that are powers of 2, such as 8, 64, 128, 256, 512, etc. The batch size depends on your GPU and the models data type.

Configure per_device_train_batch_size() in TrainingArguments.

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=256,
    per_device_eval_batch_size=256,
)

Refer to the NVIDIA Performance guide to learn more about how input features and output neuron counts and batch size affect performance. These are involved in the General Matrix Multiplications (GEMMs) performed by the GPU. Larger parameters are better for parallelization and efficiency.

The Tensor Core Requirements section is also useful for selecting a batch size that maximizes the speed of tensor multiplication based on the data type and GPU. For example, multiples of 8 are recommended for fp16, unless it’s an A100 GPU, in which case use multiples of 64.

Finally, consider Dimension Quantization Effects for smaller parameters. Tile quantization results when matrix dimensions aren’t divisible by a GPUs thread block tile size, causing the GPU to underutilize its resources. Selecting the correct batch size multiplier, such that the matrix is divisible by the tile size, can significantly speed up training.

Gradient accumulation

Gradient accumulation overcomes memory constraints - useful for fitting a very large model that otherwise wouldn’t fit on a single GPU - by accumulating gradients over multiple mini-batches before updating the parameters. This reduces memory by storing fewer gradients and enables training with a larger effective batch size because usually, the parameters are updated from a single batch of data. Training can slow down though due to the additional forward and backward passes introduced by gradient accumulation.

Configure per_device_train_batch_size() in TrainingArguments to enable gradient accumulation.

from transformers import TrainingArguments

# effective batch size of 64
args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
)

Try to avoid too many gradient accumulation steps because it can really slow down training. Consider the example below, where the maximum batch size that’ll fit on your GPU is 4. You should keep your batch size at 4 to better utilize the GPU.

batch size	gradient accumulation steps	effective batch size
1	64	64	👎
4	16	64	👍

Gradient checkpointing

Gradient checkpointing reduces memory usage by only storing some of the intermediate activations during the backward pass and recomputing the remaining activations. This avoids storing all of the intermediate activations from the forward pass, which can require a lot of memory overhead. However, it comes at the cost of slower training speed (~20%).

Configure gradient_checkpointing() in TrainingArguments to enable gradient checkpointing.

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
)

Mixed precision

Mixed precision accelerates training speed by performing some calculations in half-precision (fp16) and some in full-precision (fp32). The half-precision calculations boosts training speed because it’s not as computationally expensive as performing the calculations in full-precision. Meanwhile, preserving some of the calculations in full-precision maintains accuracy.

There are several data types available for mixed precision training.

fp16

bf16

tf32

Optimizers

Transformers implements the AdamW (adamw_torch) optimizer from PyTorch by default. But because it stores a weighted average of past gradients, it requires additional memory proportional to the number of model parameters to store the past gradients. This can be an issue when training very large models, and in such cases, you should consider choosing a different optimizer. For example, if you have Apex installed on either NVIDIA or AMD, then using the adamw_apex_fused optimizer provides the fastest training for all AdamW optimizers.

Configure optim() in TrainingArguments to choose an optimizer.

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    bf16=True,
    optim="adamw_bnb_8bit"
)

There are many optimizers to choose from (refer to OptimizerNames for a full supported list) depending on your training scenario. For example, Adafactor can significantly reduce memory requirements by storing a weighted average of a row or column instead of each element in the matrix at the cost of slower convergence. Another example is using a 8-bit AdamW optimizer from bitsandbytes to quantize optimizer states. The optimizer state is stored in a lower precision and dequantized before being used in the optimizer step.

Refer to the optimizer guide for to learn about more specialized optimizers.

Data preloading

Data preloading loads and prepares batches of data in advance on the CPU to ensure the GPU is continuously working, reducing GPU idling and increasing utilization. There are two ways to preload data to ensure the GPU is always working.

Allocate pinned memory on the CPU to store the data and transfer it directly to the GPU.
Increase the number of CPU threads or workers to preload the data faster.

Configure dataloader_pin_memory() and dataloader_num_workers() in TrainingArguments to allocate pinned memory and increase the number of workers.

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    bf16=True,
    optim="adamw_bnb_8bit",
    dataloader_pin_memory=True,
    dataloader_num_workers=4,
)

PyTorch

PyTorch provides several features for reducing memory requirements and increasing training speed. These features can often be enabled in Transformers by only adding a few lines of code.

torch.empty_cache_steps

The torch.cuda.empty_cache function releases unused cached memory, which can help avoid out-of-memory (OOM) errors at the cost of ~10% slower training.

Use torch_empty_cache_steps() in TrainingArguments to enable it after a certain number of training steps.

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    bf16=True,
    optim="adamw_bnb_8bit",
    dataloader_pin_memory=True,
    dataloader_num_workers=4,
    torch_empty_cache_steps=4,
)

torch.compile

torch.compile compiles PyTorch code into optimized kernels that significantly speed up training. This feature relies on TorchDynamo to capture PyTorch graphs with the Frame Evaluation API. The graph can be further compiled into optimized kernels for different backends.

Configure torch_compile() in TrainingArguments to enable it, and configure torch_compile_backend() to select a backend to use.

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    bf16=True,
    optim="adamw_bnb_8bit",
    dataloader_pin_memory=True,
    dataloader_num_workers=4,
    torch_empty_cache_steps=4,
    torch_compile=True,
    torch_compile_backend="inductor"
)

Refer to the table below to help you choose the right backend for your training scenario.

backend	description	goal
eager	uses PyTorch to run extracted GraphModule	debugging
aot_eager	uses PyTorch eager mode for AOTAutograd’s extracted forward and backward graphs	debugging
inductor	uses TorchInductor with AOTAutograd and CUDA Graphs by leveraging Triton kernels	training and inference
nvfuser	uses nvFuser with TorchScript	training and inference
aot_nvfuser	uses nvFuser with AOTAutograd	training and inference
aot_cudagraphs	uses CUDA Graphs with AOTAutograd	training and inference
ofi	uses TorchScripts optimize_for_inference	inference
fx2trt	uses Torch-TensorRT	inference
onnxrt	uses ONNX-RT for CPU and GPU inference	inference
ipex	uses IPEX for CPU inference	inference

Scaled dot production attention

torch.nn.functional.scaled_dot_product_attention (SDPA) is a native PyTorch implementation of the scaled dot product attention mechanism. SDPA is more efficient and optimized than the original attention mechanism in transformer models. It supports three types of scaled dot product attention.

FlashAttention2 is automatically enabled for models with the fp16 or bf16 torch type. Make sure to cast your model to the appropriate type first.
xFormers or Memory-Efficient Attention supports models with the fp32 torch type.
C++ implementation of scaled dot product attention.

SDPA is enabled by default for PyTorch 2.1.1+, but it can be explicitly enabled by setting attn_implementation="sdpa" in from_pretrained().

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto", attn_implementation="sdpa")

Update on GitHub