Multi-GPU Training

This guide shows you how to train policies on multiple GPUs using Hugging Face Accelerate.

Installation

First, ensure you have accelerate installed:

pip install accelerate

Training with Multiple GPUs

You can launch training in two ways:

Option 1: Without config (specify parameters directly)

You can specify all parameters directly in the command without running accelerate config:

accelerate launch \
  --multi_gpu \
  --num_processes=2 \
  $(which lerobot-train) \
  --dataset.repo_id=${HF_USER}/my_dataset \
  --policy.type=act \
  --policy.repo_id=${HF_USER}/my_trained_policy \
  --output_dir=outputs/train/act_multi_gpu \
  --job_name=act_multi_gpu \
  --wandb.enable=true

Key accelerate parameters:

--multi_gpu: Enable multi-GPU training
--num_processes=2: Number of GPUs to use
--mixed_precision=fp16: Use fp16 mixed precision (or bf16 if supported)

Option 2: Using accelerate config

If you prefer to save your configuration, you can optionally configure accelerate for your hardware setup by running:

accelerate config

This interactive setup will ask you questions about your training environment (number of GPUs, mixed precision settings, etc.) and saves the configuration for future use. For a simple multi-GPU setup on a single machine, you can use these recommended settings:

Compute environment: This machine
Number of machines: 1
Number of processes: (number of GPUs you want to use)
GPU ids to use: (leave empty to use all)
Mixed precision: fp16 or bf16 (recommended for faster training)

Then launch training with:

accelerate launch $(which lerobot-train) \
  --dataset.repo_id=${HF_USER}/my_dataset \
  --policy.type=act \
  --policy.repo_id=${HF_USER}/my_trained_policy \
  --output_dir=outputs/train/act_multi_gpu \
  --job_name=act_multi_gpu \
  --wandb.enable=true

How It Works

When you launch training with accelerate:

Automatic detection: LeRobot automatically detects if it’s running under accelerate
Data distribution: Your batch is automatically split across GPUs
Gradient synchronization: Gradients are synchronized across GPUs during backpropagation
Single process logging: Only the main process logs to wandb and saves checkpoints

Learning Rate and Training Steps Scaling

Important: LeRobot does NOT automatically scale learning rates or training steps based on the number of GPUs. This gives you full control over your training hyperparameters.

Why No Automatic Scaling?

Many distributed training frameworks automatically scale the learning rate by the number of GPUs (e.g., lr = base_lr × num_gpus). However, LeRobot keeps the learning rate exactly as you specify it.

When and How to Scale

If you want to scale your hyperparameters when using multiple GPUs, you should do it manually:

Learning Rate Scaling:

# Example: 2 GPUs with linear LR scaling
# Base LR: 1e-4, with 2 GPUs -> 2e-4
accelerate launch --num_processes=2 $(which lerobot-train) \
  --optimizer.lr=2e-4 \
  --dataset.repo_id=lerobot/pusht \
  --policy=act

Training Steps Scaling:

Since the effective batch size bs increases with multiple GPUs (batch_size × num_gpus), you may want to reduce the number of training steps proportionally:

# Example: 2 GPUs with effective batch size 2x larger
# Original: batch_size=8, steps=100000
# With 2 GPUs: batch_size=8 (16 in total), steps=50000
accelerate launch --num_processes=2 $(which lerobot-train) \
  --batch_size=8 \
  --steps=50000 \
  --dataset.repo_id=lerobot/pusht \
  --policy=act

Notes

The --policy.use_amp flag in lerobot-train is only used when not running with accelerate. When using accelerate, mixed precision is controlled by accelerate’s configuration.
Training logs, checkpoints, and hub uploads are only done by the main process to avoid conflicts. Non-main processes have console logging disabled to prevent duplicate output.
The effective batch size is batch_size × num_gpus. If you use 4 GPUs with --batch_size=8, your effective batch size is 32.
Learning rate scheduling is handled correctly across multiple processes—LeRobot sets step_scheduler_with_optimizer=False to prevent accelerate from adjusting scheduler steps based on the number of processes.
When saving or pushing models, LeRobot automatically unwraps the model from accelerate’s distributed wrapper to ensure compatibility.
WandB integration automatically initializes only on the main process, preventing multiple runs from being created.

For more advanced configurations and troubleshooting, see the Accelerate documentation. If you want to learn more about how to train on a large number of GPUs, checkout this awesome guide: Ultrascale Playbook.

Update on GitHub