Accelerate

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Comparing performance across distributed setups

Evaluating and comparing the performance from different setups can be quite tricky if you don’t know what to look for. For example, you cannot run the same script with the same batch size across TPU, multi-GPU, and single-GPU with Accelerate and expect your results to line up.

But why?

There are three reasons for this that this tutorial will cover:

Setting the right seeds
Observed Batch Sizes
Learning Rates

Setting the Seed

While this issue has not come up as much, make sure to use utils.set_seed() to fully set the seed in all distributed cases so training will be reproducible:

from accelerate.utils import set_seed

set_seed(42)

Why is this important? Under the hood this will set 5 different seed settings:

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed) # or torch.xpu.manual_seed_all, etc
    # ^^ safe to call this function even if cuda is not available
    if is_torch_xla_available():
        xm.set_rng_state(seed)

The random state, numpy’s state, torch, torch’s device state, and if TPUs are available torch_xla’s cuda state.

Observed Batch Sizes

When training with Accelerate, the batch size passed to the dataloader is the batch size per GPU. What this entails is a batch size of 64 on two GPUs is truly a batch size of 128. As a result, when testing on a single GPU this needs to be accounted for, as well as similarly for TPUs.

The below table can be used as a quick reference to try out different batch sizes:

In this example, there are two GPUs for “Multi-GPU” and a TPU pod with 8 workers

Single GPU Batch Size	Multi-GPU Equivalent Batch Size	TPU Equivalent Batch Size
256	128	32
128	64	16
64	32	8
32	16	4

Learning Rates

As noted in multiple sources[1][2], the learning rate should be scaled linearly based on the number of devices present. The below snippet shows doing so with Accelerate:

Since users can have their own learning rate schedulers defined, we leave this up to the user to decide if they wish to scale their learning rate or not.

learning_rate = 1e-3
accelerator = Accelerator()
learning_rate *= accelerator.num_processes

optimizer = AdamW(params=model.parameters(), lr=learning_rate)

You will also find that accelerate will step the learning rate based on the number of processes being trained on. This is because of the observed batch size noted earlier. So in the case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).

Gradient Accumulation and Mixed Precision

When using gradient accumulation and mixed precision, due to how gradient averaging works (accumulation) and the precision loss (mixed precision), some degradation in performance is expected. This will be explicitly seen when comparing the batch-wise loss between different compute setups. However, the overall loss, metric, and general performance at the end of training should be roughly the same.

Update on GitHub

←Loading big models into memory Executing and deferring jobs→