LeRobot documentation
Multi-GPU Training
Multi-GPU Training
This guide shows you how to train policies on multiple GPUs using Hugging Face Accelerate.
Installation
First, ensure you have accelerate installed:
pip install accelerate
Training with Multiple GPUs
You can launch training in two ways:
Option 1: Without config (specify parameters directly)
You can specify all parameters directly in the command without running accelerate config:
accelerate launch \
  --multi_gpu \
  --num_processes=2 \
  $(which lerobot-train) \
  --dataset.repo_id=${HF_USER}/my_dataset \
  --policy.type=act \
  --policy.repo_id=${HF_USER}/my_trained_policy \
  --output_dir=outputs/train/act_multi_gpu \
  --job_name=act_multi_gpu \
  --wandb.enable=trueKey accelerate parameters:
- --multi_gpu: Enable multi-GPU training
- --num_processes=2: Number of GPUs to use
- --mixed_precision=fp16: Use fp16 mixed precision (or- bf16if supported)
Option 2: Using accelerate config
If you prefer to save your configuration, you can optionally configure accelerate for your hardware setup by running:
accelerate config
This interactive setup will ask you questions about your training environment (number of GPUs, mixed precision settings, etc.) and saves the configuration for future use. For a simple multi-GPU setup on a single machine, you can use these recommended settings:
- Compute environment: This machine
- Number of machines: 1
- Number of processes: (number of GPUs you want to use)
- GPU ids to use: (leave empty to use all)
- Mixed precision: fp16 or bf16 (recommended for faster training)
Then launch training with:
accelerate launch $(which lerobot-train) \
  --dataset.repo_id=${HF_USER}/my_dataset \
  --policy.type=act \
  --policy.repo_id=${HF_USER}/my_trained_policy \
  --output_dir=outputs/train/act_multi_gpu \
  --job_name=act_multi_gpu \
  --wandb.enable=trueHow It Works
When you launch training with accelerate:
- Automatic detection: LeRobot automatically detects if it’s running under accelerate
- Data distribution: Your batch is automatically split across GPUs
- Gradient synchronization: Gradients are synchronized across GPUs during backpropagation
- Single process logging: Only the main process logs to wandb and saves checkpoints
Learning Rate and Training Steps Scaling
Important: LeRobot does NOT automatically scale learning rates or training steps based on the number of GPUs. This gives you full control over your training hyperparameters.
Why No Automatic Scaling?
Many distributed training frameworks automatically scale the learning rate by the number of GPUs (e.g., lr = base_lr × num_gpus).
However, LeRobot keeps the learning rate exactly as you specify it.
When and How to Scale
If you want to scale your hyperparameters when using multiple GPUs, you should do it manually:
Learning Rate Scaling:
# Example: 2 GPUs with linear LR scaling
# Base LR: 1e-4, with 2 GPUs -> 2e-4
accelerate launch --num_processes=2 $(which lerobot-train) \
  --optimizer.lr=2e-4 \
  --dataset.repo_id=lerobot/pusht \
  --policy=actTraining Steps Scaling:
Since the effective batch size bs increases with multiple GPUs (batch_size × num_gpus), you may want to reduce the number of training steps proportionally:
# Example: 2 GPUs with effective batch size 2x larger
# Original: batch_size=8, steps=100000
# With 2 GPUs: batch_size=8 (16 in total), steps=50000
accelerate launch --num_processes=2 $(which lerobot-train) \
  --batch_size=8 \
  --steps=50000 \
  --dataset.repo_id=lerobot/pusht \
  --policy=actNotes
- The --policy.use_ampflag inlerobot-trainis only used when not running with accelerate. When using accelerate, mixed precision is controlled by accelerate’s configuration.
- Training logs, checkpoints, and hub uploads are only done by the main process to avoid conflicts. Non-main processes have console logging disabled to prevent duplicate output.
- The effective batch size is batch_size × num_gpus. If you use 4 GPUs with--batch_size=8, your effective batch size is 32.
- Learning rate scheduling is handled correctly across multiple processes—LeRobot sets step_scheduler_with_optimizer=Falseto prevent accelerate from adjusting scheduler steps based on the number of processes.
- When saving or pushing models, LeRobot automatically unwraps the model from accelerate’s distributed wrapper to ensure compatibility.
- WandB integration automatically initializes only on the main process, preventing multiple runs from being created.
For more advanced configurations and troubleshooting, see the Accelerate documentation. If you want to learn more about how to train on a large number of GPUs, checkout this awesome guide: Ultrascale Playbook.
Update on GitHub