Transformers documentation
DDP
Get started
Base classes
Models
Preprocessors
Inference
Pipeline API
Generate API
Optimization
Chat with models
Serving
Training
Get started
Customization
Performance
Distributed training
Accelerator selectionAccelerateDDPFSDP2DeepSpeed ZeROUlysses sequence parallelismTensor parallelismDebuggingParallelism methods
Hardware
Quantization
Ecosystem integrations
Resources
API
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v5.12.0).
DDP
DistributedDataParallel (DDP) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs so every model copy stays identical. Use DDP when your model fits on a single GPU.
βββββββββββββββββββ
β training data β
ββββββββββ¬βββββββββ
ββββββββββββββββββββΌβββββββββββββββββββ
β shard 0 β shard 1 β shard 2
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β model β β model β β model β
β (copy 0) β β (copy 1) β β (copy 2) β
β GPU 0 β β GPU 1 β β GPU 2 β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β grads β grads β grads
ββββββββββββββββββββΌβββββββββββββββββββ
all-reduce
(average gradients)
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β optimizer β β optimizer β β optimizer β
β step β β step β β step β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
(identical) (identical) (identical)DDP activates automatically when you launch with a multi-process launcher like Accelerate.
# 4 GPUs on one machine
accelerate launch --num_processes 4 train.pyConfigure DDP
Pass these TrainingArguments to control DDP behavior.
gradient_accumulation_steps()determines when to perform the all-reduce. Trainer skips the all-reduce on intermediate accumulation steps and runs it only on the final micro-batch. For example, withgradient_accumulation_steps=4, the all-reduce runs every 4 backward passes.~TrainingArguments.ddp_find_unused_parameterstraverses the autograd graph at the end of the forward pass for parameters that wonβt receive a gradient and marks them as ready so they donβt block the all-reduce. Donβt use withgradient_checkpointing()because gradient checkpointing discards intermediate activations and recomputes them on the fly.~TrainingArguments.ddp_bucket_cap_mbis the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.~TrainingArguments.ddp_broadcast_bufferssynchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Donβt use withgradient_checkpointing().~TrainingArguments.ddp_backendsets the communication backend. Use"nccl"for NVIDIA GPUs (default and fastest),"gloo"for CPU training or debugging, and"xccl","hccl", or"cncl"for other hardware.ddp_timeout()sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.
from transformers import TrainingArguments
args = TrainingArguments(
...,
gradient_accumulation_steps=4,
ddp_backend="nccl",
ddp_find_unused_parameters=False,
ddp_bucket_cap_mb=25,
ddp_broadcast_buffers=True,
ddp_timeout=1800,
)Next steps
- See FSDP for training models too large to fit on a single GPU.
- See DeepSpeed for ZeRO optimization and offloading.
- Read the Data Parallelism chapter from The Ultra-Scale Playbook for more information about how DDP works.