Accelerate

Run your raw PyTorch training script on any kind of device

Features

🤗 Accelerate provides an easy API to make your scripts run with mixed precision and on any kind of distributed setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then runs seamlessly on your local machine for debugging or your training environment.
🤗 Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment then launch the scripts.

Easy to integrate

A traditional training loop in PyTorch looks like this:

my_model.to(device)

for batch in my_training_dataloader:
    my_optimizer.zero_grad()
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = my_model(inputs)
    loss = my_loss_function(outputs, targets)
    loss.backward()
    my_optimizer.step()

Changing it to work with accelerate is really easy and only adds a few lines of code:

+ from accelerate import Accelerator

+ accelerator = Accelerator()
  # Use the device given by the *accelerator* object.
+ device = accelerator.device
  my_model.to(device)
  # Pass every important object (model, optimizer, dataloader) to *accelerator.prepare*
+ my_model, my_optimizer, my_training_dataloader = accelerator.prepare(
+     my_model, my_optimizer, my_training_dataloader
+ )

  for batch in my_training_dataloader:
      my_optimizer.zero_grad()
      inputs, targets = batch
      inputs = inputs.to(device)
      targets = targets.to(device)
      outputs = my_model(inputs)
      loss = my_loss_function(outputs, targets)
      # Just a small change for the backward instruction
-     loss.backward()
+     accelerator.backward(loss)
      my_optimizer.step()

and with this, your script can now run in a distributed environment (multi-GPU, TPU).

You can even simplify your script a bit by letting 🤗 Accelerate handle the device placement for you (which is safer, especially for TPU training):

+ from accelerate import Accelerator

+ accelerator = Accelerator()
- my_model.to(device)
  # Pass every important object (model, optimizer, dataloader) to *accelerator.prepare*
+ my_model, my_optimizer, my_training_dataloader = accelerate.prepare(
+     my_model, my_optimizer, my_training_dataloader
+ )

  for batch in my_training_dataloader:
      my_optimizer.zero_grad()
      inputs, targets = batch
-     inputs = inputs.to(device)
-     targets = targets.to(device)
      outputs = my_model(inputs)
      loss = my_loss_function(outputs, targets)
      # Just a small change for the backward instruction
-     loss.backward()
+     accelerator.backward(loss)
      my_optimizer.step()

Script launcher

No need to remember how to use torch.distributed.launch or to write a specific launcher for TPU training! 🤗 Accelerate comes with a CLI tool that will make your life easier when launching distributed scripts.

On your machine(s) just run:

accelerate config

and answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing

accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the NLP example (from the root of the repo):

accelerate launch examples/nlp_example.py

Supported integrations

CPU only
single GPU
multi-GPU on one node (machine)
multi-GPU on several nodes (machines)
TPU
FP16 with native AMP (apex on the roadmap)
DeepSpeed (experimental support)