Accelerate

You are viewing v0.23.0 version. A newer version v1.10.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Debugging Distributed Operations

When running scripts in a distributed fashion, often functions such as Accelerator.gather() and Accelerator.reduce() (and others) are neccessary to grab tensors across devices and perform certain operations on them. However, if the tensors which are being grabbed are not the proper shapes then this will result in your code hanging forever. The only sign that exists of this truly happening is hitting a timeout exception from torch.distributed, but this can get quite costly as usually the timeout is 10 minutes.

Accelerate now has a debug mode which adds a neglible amount of time to each operation, but allows it to verify that the inputs you are bringing in can actually perform the operation you want without hitting this timeout problem!

Visualizing the problem

To have a tangible example of this issue, let’s take the following setup (on 2 GPUs):

from accelerate import PartialState

state = PartialState()
if state.process_index == 0:
    tensor = torch.tensor([[0.0, 1, 2, 3, 4]]).to(state.device)
else:
    tensor = torch.tensor([[[0.0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]]).to(state.device)

broadcast_tensor = broadcast(tensor)
print(broadcast_tensor)

We’ve created a single tensor on each device, with two radically different shapes. With this setup if we want to perform an operation such as utils.broadcast(), we would forever hit a timeout because torch.distributed requires that these operations have the exact same shape across all processes for it to work.

If you run this yourself, you will find that broadcast_tensor can be printed on the main process, but its results won’t quite be right, and then it will just hang never printing it on any of the other processes:

>>> tensor([[0, 1, 2, 3, 4]], device='cuda:0')

The solution

By enabling Accelerate’s operational debug mode, Accelerate will properly find and catch errors such as this and provide a very clear traceback immediatly:

Traceback (most recent call last):
  File "/home/zach_mueller_huggingface_co/test.py", line 18, in <module>
    main()
  File "/home/zach_mueller_huggingface_co/test.py", line 15, in main
        main()broadcast_tensor = broadcast(tensor)
  File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/utils/operations.py", line 303, in wrapper
    broadcast_tensor = broadcast(tensor)
accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid.

Operation: `accelerate.utils.operations.broadcast`
Input shapes:
  - Process 0: [1, 5]
  - Process 1: [1, 2, 5]

This explains that the shapes across our devices were not the same, and that we should ensure that they match properly to be compatible. Typically this means that there is either an extra dimension, or certain dimensions are incompatible with the operation.

To enable this please do one of the following:

Enable it through the questionarre during accelerate config (recommended)

From the CLI:

accelerate launch --debug {my_script.py} --arg1 --arg2

As an environmental variable (which avoids the need for accelerate launch):

ACCELERATE_DEBUG_MODE="1" accelerate launch {my_script.py} --arg1 --arg2

Manually changing the config.yaml file:

 compute_environment: LOCAL_MACHINE
+debug: true

←Using experiment trackers How to avoid CUDA Out-of-Memory→