Memory Utilities

One of the most frustrating errors when it comes to running training scripts is hitting “CUDA Out-of-Memory”, as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply start their script and let it run.

Accelerate provides a utility heavily based on toma to give this capability.

find_executable_batch_size

This algorithm operates with exponential decay, decreasing the batch size in half after each failed run on some training script. To use it, restructure your training function to include an inner function that includes this wrapper, and build your dataloaders inside it. At a minimum, this could look like 4 new lines of code.

Note: The inner function must take in the batch size as the first parameter, but we do not pass one to it when called. The wrapper handles this for us

def training_function(args):
    accelerator = Accelerator()
    model = get_model()
    model.to(accelerator.device)
    optimizer = get_optimizer()

+   @find_executable_batch_size(starting_batch_size=args.batch_size)
+   def inner_training_loop(batch_size):
+       nonlocal model, optimizer # Ensure they can be used in our context
        train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
        lr_scheduler = get_scheduler(
            optimizer, 
            num_training_steps=len(train_dataloader)*num_epochs
        )
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
            model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
        )
        train(model, optimizer, train_dataloader, lr_scheduler)
        validate(model, eval_dataloader)
+   inner_training_loop()

accelerate.find_executable_batch_size

< source >

( function: callable = None starting_batch_size: int = 128 )

Parameters

function (callable, optional) — A function to wrap
starting_batch_size (int, optional) — The batch size to try and fit into memory

A basic decorator that will try to execute function. If it fails from exceptions related to out-of-memory or CUDNN, the batch size is cut in half and passed to function

function must take in a batch_size parameter as its first argument.