Accelerate documentation

Internals

You are viewing v0.11.0 version. A newer version v1.1.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Internals

Gradient Accumulation states

class accelerate.state.GradientState

< >

( )

This is a variation of a singleton class in the sense that all instance of GradientState share the same state, which is initialized on the first instantiation.

This specific state revolves around whether gradients should be synced and if we have reached the end of a prepared dataloader Attributes:

  • sync_gradients (bool) — Whether the gradients should be synced
  • end_of_dataloader (bool) — Whether we have reached the end the current dataloader

Optimizer

class accelerate.optimizer.AcceleratedOptimizer

< >

( optimizer device_placement = True scaler = None )

Parameters

  • optimizer (torch.optim.optimizer.Optimizer) — The optimizer to wrap.
  • device_placement (bool, optional, defaults to True) — Whether or not the optimizer should handle device placement. If so, it will place the state dictionary of optimizer on the right device.
  • scaler (torch.cuda.amp.grad_scaler.GradScaler, optional) — The scaler to use in the step function if training with mixed precision.

Internal wrapper around a torch optimizer.

Conditionally will perform step and zero_grad if gradients should be synchronized when performing gradient accumulation.

DataLoader

The main work on your PyTorch DataLoader is done by the following function:

accelerate.data_loader.prepare_data_loader

< >

( dataloader: DataLoader device: typing.Optional[torch.device] = None num_processes: typing.Optional[int] = None process_index: typing.Optional[int] = None split_batches: bool = False put_on_device: bool = False rng_types: typing.Union[typing.List[typing.Union[str, accelerate.utils.dataclasses.RNGType]], NoneType] = None dispatch_batches: typing.Optional[bool] = None ) torch.utils.data.dataloader.DataLoader

Parameters

  • dataloader (torch.utils.data.dataloader.DataLoader) — The data loader to split across several devices.
  • device (torch.device) — The target device for the returned DataLoader.
  • num_processes (int, optional) — The number of processes running concurrently. Will default to the value given by AcceleratorState.
  • process_index (int, optional) — The index of the current process. Will default to the value given by AcceleratorState.
  • split_batches (bool, optional, defaults to False) — Whether the resulting DataLoader should split the batches of the original data loader across devices or yield full batches (in which case it will yield batches starting at the process_index-th and advancing of num_processes batches at each iteration).

    Another way to see this is that the observed batch size will be the same as the initial dataloader if this option is set to True, the batch size of the initial dataloader multiplied by num_processes otherwise.

    Setting this option to True requires that the batch size of the dataloader is a round multiple of batch_size.

  • put_on_device (bool, optional, defaults to False) — Whether or not to put the batches on device (only works if the batches are nested list, tuples or dictionaries of tensors).
  • rng_types (list of str or RNGType) — The list of random number generators to synchronize at the beginning of each iteration. Should be one or several of:

    • "torch": the base torch random number generator
    • "cuda": the CUDA random number generator (GPU only)
    • "xla": the XLA random number generator (TPU only)
    • "generator": the torch.Generator of the sampler (or batch sampler if there is no sampler in your dataloader) or of the iterable dataset (if it exists) if the underlying dataset is of that type.
  • dispatch_batches (bool, optional) — If set to True, the datalaoder prepared is only iterated through on the main process and then the batches are split and broadcast to each process. Will default to True when the underlying dataset is an IterableDataset, False otherwise.

Returns

torch.utils.data.dataloader.DataLoader

A new data loader that will yield the portion of the batches

Wraps a PyTorch DataLoader to generate batches for one of the processes only.

Depending on the value of the drop_last attribute of the dataloader passed, it will either stop the iteration at the first batch that would be too small / not present on all processes or loop with indices from the beginning.

This does not support BatchSampler with varying batch size yet.

DataLoaderShard

class accelerate.data_loader.DataLoaderShard

< >

( *args **kwds )

Parameters

  • dataset (torch.utils.data.dataset.Dataset) — The dataset to use to build this datalaoder.
  • device (torch.device, optional) — If passed, the device to put all batches on.
  • rng_types (list of str or RNGType) — The list of random number generators to synchronize at the beginning of each iteration. Should be one or several of:

    • "torch": the base torch random number generator
    • "cuda": the CUDA random number generator (GPU only)
    • "xla": the XLA random number generator (TPU only)
    • "generator": an optional torch.Generator
  • generator (torch.Generator, optional) — A random number generator to keep synchronized across processes. kwargs — All other keyword arguments to pass to the regular DataLoader initialization.

Subclass of a PyTorch DataLoader that will deal with device placement and current distributed setup.

Available attributes:

  • total_batch_size (int) — Total batch size of the dataloader across all processes. Equal to the original batch size when split_batches=True; otherwise the original batch size * the total number of processes

BatchSamplerShard

class accelerate.data_loader.BatchSamplerShard

< >

( *args **kwds )

Parameters

  • batch_sampler (torch.utils.data.sampler.BatchSampler) — The batch sampler to split in several shards.
  • num_processes (int, optional, defaults to 1) — The number of processes running concurrently.
  • process_index (int, optional, defaults to 0) — The index of the current process.
  • split_batches (bool, optional, defaults to False) — Whether the shards should be created by splitting a batch to give a piece of it on each process, or by yielding different full batches on each process.

    On two processes with a sampler of [[0, 1, 2, 3], [4, 5, 6, 7]], this will result in:

    • the sampler on process 0 to yield [0, 1, 2, 3] and the sampler on process 1 to yield [4, 5, 6, 7] if this argument is set to False.
    • the sampler on process 0 to yield [0, 1] then [4, 5] and the sampler on process 1 to yield [2, 3] then [6, 7] if this argument is set to True.

Wraps a PyTorch BatchSampler to generate batches for one of the processes only. Instances of this class will always yield a number of batches that is a round multiple of num_processes and that all have the same size. Depending on the value of the drop_last attribute of the batch sampler passed, it will either stop the iteration at the first batch that would be too small / not present on all processes or loop with indices from the beginning.

This does not support BatchSampler with varying batch size yet.

IterableDatasetShard

class accelerate.data_loader.IterableDatasetShard

< >

( *args **kwds )

Parameters

  • dataset (torch.utils.data.dataset.IterableDataset) — The batch sampler to split in several shards.
  • batch_size (int, optional, defaults to 1) — The size of the batches per shard (if split_batches=False) or the size of the batches (if split_batches=True).
  • drop_last (bool, optional, defaults to False) — Whether or not to drop the last incomplete batch or complete the last batches by using the samples from the beginning.
  • num_processes (int, optional, defaults to 1) — The number of processes running concurrently.
  • process_index (int, optional, defaults to 0) — The index of the current process.
  • split_batches (bool, optional, defaults to False) — Whether the shards should be created by splitting a batch to give a piece of it on each process, or by yielding different full batches on each process.

    On two processes with an iterable dataset yielding of [0, 1, 2, 3, 4, 5, 6, 7], this will result in:

    • the shard on process 0 to yield [0, 1, 2, 3] and the shard on process 1 to yield [4, 5, 6, 7] if this argument is set to False.
    • the shard on process 0 to yield [0, 1, 4, 5] and the sampler on process 1 to yield [2, 3, 6, 7] if this argument is set to True.

Wraps a PyTorch IterableDataset to generate samples for one of the processes only. Instances of this class will always yield a number of samples that is a round multiple of the actual batch size (depending of the value of split_batches, this is either batch_size or batch_size x num_processes). Depending on the value of the drop_last attribute of the batch sampler passed, it will either stop the iteration at the first batch that would be too small or loop with indices from the beginning.

Scheduler

class accelerate.scheduler.AcceleratedScheduler

< >

( scheduler optimizers step_with_optimizer: bool = True split_batches: bool = False )

Parameters

  • scheduler (torch.optim.lr_scheduler._LRScheduler) — The scheduler to wrap.
  • optimizers (one or a list of torch.optim.Optimizer) — The optimizers used.
  • step_with_optimizer (bool, optional, defaults to True) — Whether or not the scheduler should be stepped at each optimizer step.
  • split_batches (bool, optional, defaults to False) — Whether or not the dataloaders split one batch across the different processes (so batch size is the same regardless of the number of processes) or create batches on each process (so batch size is the original batch size multiplied by the number of processes).

A wrapper around a learning rate scheduler that will only step when the optimizer(s) have a training step. Useful to avoid making a scheduler step too fast when gradients went overflow and there was no training step (in mixed precision training)

When performing gradient accumulation scheduler lengths should not be changed accordingly, accelerate will always step the scheduler to account for it.

Distributed Config

AcceleratorState

class accelerate.state.AcceleratorState

< >

( mixed_precision: str = None cpu: bool = False deepspeed_plugin = None fsdp_plugin = None _from_accelerator: bool = False **kwargs )

Parameters

  • - device (torch.device) — The device to use. —
  • - sync_gradients (bool) — Whether to sync the gradients or not —
  • - distributed_type (~accelerate.state.DistributedType) — The type of distributed environment currently — in use.
  • - num_processes (int) — The number of processes currently launched in parallel. —
  • - process_index (int) — The index of the current process. —
  • - local_process_index (int) — The index of the current process on the current server. —
  • - mixed_precision (str) — Whether or not the current script will use mixed precision. If you are using — mixed precision, define if you want to use FP16 or BF16 (bfloat16) as the floating point.

This is a variation of a singleton class in the sense that all instance of AcceleratorState share the same state, which is initialized on the first instantiation.

Tracking

class accelerate.tracking.GeneralTracker

< >

( )

A base Tracker class to be used for all logging integration implementations.

finish

< >

( )

Should run any finalizing functions within the tracking API. If the API should not have one, just don’t overwrite that method.

log

< >

( values: dict step: typing.Optional[int] )

Parameters

  • values (Dictionary str to str, float, or int) — Values to be logged as key-value pairs. The values need to have type str, float, or int.
  • step (int, optional) — The run step. If included, the log will be affiliated with this step.

Logs values to the current run. Base log implementations of a tracking API should go in here, along with special behavior for the `step parameter.

store_init_configuration

< >

( values: dict )

Parameters

  • values (Dictionary str to bool, str, float or int) — Values to be stored as initial hyperparameters as key-value pairs. The values need to have type bool, str, float, int, or None.

Logs values as hyperparameters for the run. Implementations should use the experiment configuration functionality of a tracking API.