Accelerate
Run your raw PyTorch training script on any kind of device.
Features
π€ Accelerate provides an easy API to make your scripts run with mixed precision and in any kind of distributed setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then run seamlessly on your local machine for debugging or your training environment.
π€ Accelerate also provides a CLI tool that allows you to quickly configure and test your training environment and then launch the scripts.
Easy to integrate
A traditional training loop in PyTorch looks like this:
my_model.to(device)
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
loss.backward()
my_optimizer.step()
Changing it to work with accelerate is really easy and only adds a few lines of code:
+ from accelerate import Accelerator
+ accelerator = Accelerator()
# Use the device given by the *accelerator* object.
+ device = accelerator.device
my_model.to(device)
# Pass every important object (model, optimizer, dataloader) to *accelerator.prepare*
+ my_model, my_optimizer, my_training_dataloader = accelerator.prepare(
+ my_model, my_optimizer, my_training_dataloader
+ )
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
# Just a small change for the backward instruction
- loss.backward()
+ accelerator.backward(loss)
my_optimizer.step()
and with this, your script can now run in a distributed environment (multi-GPU, TPU).
You can even simplify your script a bit by letting π€ Accelerate handle the device placement for you (which is safer, especially for TPU training):
+ from accelerate import Accelerator
+ accelerator = Accelerator()
- my_model.to(device)
# Pass every important object (model, optimizer, dataloader) to *accelerator.prepare*
+ my_model, my_optimizer, my_training_dataloader = accelerate.prepare(
+ my_model, my_optimizer, my_training_dataloader
+ )
for batch in my_training_dataloader:
my_optimizer.zero_grad()
inputs, targets = batch
- inputs = inputs.to(device)
- targets = targets.to(device)
outputs = my_model(inputs)
loss = my_loss_function(outputs, targets)
# Just a small change for the backward instruction
- loss.backward()
+ accelerator.backward(loss)
my_optimizer.step()
Script launcher
No need to remember how to use torch.distributed.launch
or to write a specific launcher for TPU training! π€
Accelerate comes with a CLI tool that will make your life easier when launching distributed scripts.
On your machine(s) just run:
accelerate config
and answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing
accelerate launch my_script.py --args_to_my_script
For instance, here is how you would run the NLP example (from the root of the repo):
accelerate launch examples/nlp_example.py
Supported integrations
- CPU only
- single GPU
- multi-GPU on one node (machine)
- multi-GPU on several nodes (machines)
- TPU
- FP16 with native AMP (apex on the roadmap)
- DeepSpeed (experimental support)