Add Accelerate to your code
Each distributed training framework has their own way of doing things which can require writing a lot of custom code to adapt it to your PyTorch training code and training environment. Accelerate offers a friendly way to interface with these distributed training frameworks without having to learn the specific details of each one. Accelerate takes care of those details for you, so you can focus on the training code and scale it to any distributed training environment.
In this tutorial, you’ll learn how to adapt your existing PyTorch code with Accelerate and get you on your way toward training on distributed systems with ease! You’ll start with a basic PyTorch training loop (it assumes all the training objects like model
and optimizer
have been setup already) and progressively integrate Accelerate into it.
device = "cuda"
model.to(device)
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss.backward()
optimizer.step()
scheduler.step()
Accelerator
The Accelerator is the main class for adapting your code to work with Accelerate. It knows about the distributed setup you’re using such as the number of different processes and your hardware type. This class also provides access to many of the necessary methods for enabling your PyTorch code to work in any distributed training environment and for managing and executing processes across devices.
That’s why you should always start by importing and creating an Accelerator instance in your script.
from accelerate import Accelerator
accelerator = Accelerator()
The Accelerator also knows which device to move your PyTorch objects to, so it is recommended to let Accelerate handle this for you.
- device = "cuda"
+ device = accelerator.device
model.to(device)
Prepare PyTorch objects
Next, you need to prepare your PyTorch objects (model, optimizer, scheduler, etc.) for distributed training. The prepare() method takes care of placing your model in the appropriate container (like single GPU or multi-GPU) for your training setup, adapting the optimizer and scheduler to use Accelerate’s AcceleratedOptimizer and AcceleratedScheduler, and creating a new dataloader that can be sharded across processes.
Accelerate only prepares objects that inherit from their respective PyTorch classes such as torch.optim.Optimizer
.
The PyTorch objects are returned in the same order they’re sent.
model, optimizer, training_dataloader, scheduler = accelerator.prepare( model, optimizer, training_dataloader, scheduler )
Training loop
Finally, remove the to(device)
calls to the inputs and targets in the training loop because Accelerate’s DataLoader classes automatically places them on the right device. You should also replace the usual backward()
pass with Accelerate’s backward() method which scales the gradients for you and uses the appropriate backward()
method depending on your distributed setup (for example, DeepSpeed or Megatron).
- inputs = inputs.to(device)
- targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
- loss.backward()
+ accelerator.backward(loss)
Put everything together and your new Accelerate training loop should now look like this!
from accelerate import Accelerator
accelerator = Accelerator()
device = accelerator.device
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
model, optimizer, training_dataloader, scheduler
)
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
Training features
Accelerate offers additional features - like gradient accumulation, gradient clipping, mixed precision training and more - you can add to your script to improve your training run. Let’s explore these three features.
Gradient accumulation
Gradient accumulation enables you to train on larger batch sizes by accumulating the gradients over multiple batches before updating the weights. This can be useful for getting around memory limitations. To enable this feature in Accelerate, specify the gradient_accumulation_steps
parameter in the Accelerator class and add the accumulate() context manager to your script.
+ accelerator = Accelerator(gradient_accumulation_steps=2)
model, optimizer, training_dataloader = accelerator.prepare(model, optimizer, training_dataloader)
for input, label in training_dataloader:
+ with accelerator.accumulate(model):
predictions = model(input)
loss = loss_function(predictions, label)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Gradient clipping
Gradient clipping is a technique to prevent “exploding gradients”, and Accelerate offers:
- clipgrad_value() to clip gradients to a minimum and maximum value
- clipgrad_norm() for normalizing gradients to a certain value
Mixed precision
Mixed precision accelerates training by using a lower precision data type like fp16 (half-precision) to calculate the gradients. For the best performance with Accelerate, the loss should be computed inside your model (like in Transformers models) because computations outside of the model are computed in full precision.
Set the mixed precision type to use in the Accelerator, and then use the autocast() context manager to automatically cast the values to the specified data type.
Accelerate enables automatic mixed precision, so autocast() is only needed if there are other mixed precision operations besides those performed on loss by backward() which already handles the scaling.
+ accelerator = Accelerator(mixed_precision="fp16")
+ with accelerator.autocast():
loss = complex_loss_function(outputs, target)
Save and load
Accelerate can also save and load a model once training is complete or you can also save the model and optimizer state which could be useful for resuming training.
Model
Once all processes are complete, unwrap the model with the unwrap_model() method before saving it because the prepare() method wrapped your model into the proper interface for distributed training. If you don’t unwrap the model, saving the model state dictionary also saves any potential extra layers from the larger model and you won’t be able to load the weights back into your base model.
You should use the save_model() method to unwrap and save the model state dictionary. This method can also save a model into sharded checkpoints or into the safetensors format.
accelerator.wait_for_everyone() accelerator.save_model(model, save_directory)
For models from the Transformers library, save the model with the save_pretrained method so that it can be reloaded with the from_pretrained method.
from transformers import AutoModel
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
"path/to/my_model_directory",
is_main_process=accelerator.is_main_process,
save_function=accelerator.save,
)
model = AutoModel.from_pretrained("path/to/my_model_directory")
To load your weights, use the unwrap_model() method to unwrap the model first before loading the weights. All model parameters are references to tensors, so this loads your weights inside model
.
unwrapped_model = accelerator.unwrap_model(model)
path_to_checkpoint = os.path.join(save_directory,"pytorch_model.bin")
unwrapped_model.load_state_dict(torch.load(path_to_checkpoint))
State
During training, you may want to save the current state of the model, optimizer, random generators, and potentially learning rate schedulers so they can be restored in the same script. You should add the save_state() and load_state() methods to your script to save and load states.
To further customize where and how states are saved through save_state(), use the ProjectConfiguration class. For example, if automatic_checkpoint_naming
is enabled, each saved checkpoint is stored at Accelerator.project_dir/checkpoints/checkpoint_{checkpoint_number}
.
Any other stateful items to be stored should be registered with the register_for_checkpointing() method so they can be saved and loaded. Every object passed to this method to be stored must have a load_state_dict
and state_dict
function.
If you have torchdata>=0.8.0
installed, you can additionally pass use_stateful_dataloader=True
into your DataLoaderConfiguration. This extends Accelerate’s DataLoader classes with a load_state_dict
and state_dict
function, and makes it so Accelerator.save_state
and Accelerator.load_state
also track how far into the training dataset it has read when persisting the model.