Accelerate documentation

Kwargs handlers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v1.1.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Kwargs handlers

The following objects can be passed to the main Accelerator to customize how some PyTorch objects related to distributed training or mixed precision are created.

AutocastKwargs

class accelerate.AutocastKwargs

< >

( enabled: bool = True cache_enabled: bool = None )

Use this object in your Accelerator to customize how torch.autocast behaves. Please refer to the documentation of this context manager for more information on each argument.

Example:

from accelerate import Accelerator
from accelerate.utils import AutocastKwargs

kwargs = AutocastKwargs(cache_enabled=True)
accelerator = Accelerator(kwargs_handlers=[kwargs])

DistributedDataParallelKwargs

class accelerate.DistributedDataParallelKwargs

< >

( dim: int = 0 broadcast_buffers: bool = True bucket_cap_mb: int = 25 find_unused_parameters: bool = False check_reduction: bool = False gradient_as_bucket_view: bool = False static_graph: bool = False comm_hook: DDPCommunicationHookType = <DDPCommunicationHookType.NO: 'no'> comm_wrapper: typing.Literal[<DDPCommunicationHookType.NO: 'no'>, <DDPCommunicationHookType.FP16: 'fp16'>, <DDPCommunicationHookType.BF16: 'bf16'>] = <DDPCommunicationHookType.NO: 'no'> comm_state_option: dict = <factory> )

Use this object in your Accelerator to customize how your model is wrapped in a torch.nn.parallel.DistributedDataParallel. Please refer to the documentation of this wrapper for more information on each argument.

gradient_as_bucket_view is only available in PyTorch 1.7.0 and later versions.

static_graph is only available in PyTorch 1.11.0 and later versions.

Example:

from accelerate import Accelerator
from accelerate.utils import DistributedDataParallelKwargs

kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[kwargs])

FP8RecipeKwargs

class accelerate.utils.FP8RecipeKwargs

< >

( backend: typing.Literal['MSAMP', 'TE'] = None use_autocast_during_eval: bool = None opt_level: typing.Literal['O1', 'O2'] = None margin: int = None interval: int = None fp8_format: typing.Literal['E4M3', 'HYBRID'] = None amax_history_len: int = None amax_compute_algo: typing.Literal['max', 'most_recent'] = None override_linear_precision: typing.Tuple[bool, bool, bool] = None )

Parameters

  • backend (str, optional) — Which FP8 engine to use. Must be one of "msamp" (MS-AMP) or "te" (TransformerEngine). If not passed, will use whichever is available in the environment, prioritizing MS-AMP.
  • use_autocast_during_eval (bool, optional, default to False) — Whether to use FP8 autocast during eval mode. Generally better metrics are found when this is False.
  • margin (int, optional, default to 0) — The margin to use for the gradient scaling.
  • interval (int, optional, default to 1) — The interval to use for how often the scaling factor is recomputed.
  • fp8_format (str, optional, default to “HYBRID”) — The format to use for the FP8 recipe. Must be one of HYBRID or E4M3. (Generally HYBRID for training, E4M3 for evaluation)
  • amax_history_len (int, optional, default to 1024) — The length of the history to use for the scaling factor computation
  • amax_compute_algo (str, optional, default to “most_recent”) — The algorithm to use for the scaling factor computation. Must be one of max or most_recent.
  • override_linear_precision (tuple of three bool, optional, default to (False, False, False)) — Whether or not to execute fprop, dgrad, and wgrad GEMMS in higher precision.
  • optimization_level (str), one of O1, O2. (default is O2) — What level of 8-bit collective communication should be used with MS-AMP. In general:
    • O1: Weight gradients and all_reduce communications are done in fp8, reducing GPU memory usage and communication bandwidth
    • O2: First-order optimizer states are in 8-bit, and second order states are in FP16. Only available when using Adam or AdamW. This maintains accuracy and can potentially save the highest memory.
    • 03: Specifically for DeepSpeed, implements capabilities so weights and master weights of models are stored in FP8. If fp8 is selected and deepspeed is enabled, will be used by default. (Not available currently).

Use this object in your Accelerator to customize the initialization of the recipe for FP8 mixed precision training with transformer-engine or ms-amp.

For more information on transformer-engine args, please refer to the API documentation.

For more information on the ms-amp args, please refer to the Optimization Level documentation.

from accelerate import Accelerator
from accelerate.utils import FP8RecipeKwargs

kwargs = FP8RecipeKwargs(backend="te", fp8_format="HYBRID")
accelerator = Accelerator(mixed_precision="fp8", kwargs_handlers=[kwargs])

To use MS-AMP as an engine, pass backend="msamp" and the optimization_level:

kwargs = FP8RecipeKwargs(backend="msamp", optimization_level="02")

ProfileKwargs

class accelerate.ProfileKwargs

< >

( activities: typing.Optional[typing.List[typing.Literal['cpu', 'xpu', 'mtia', 'cuda']]] = None schedule_option: typing.Optional[typing.Dict[str, int]] = None on_trace_ready: typing.Optional[typing.Callable] = None record_shapes: bool = False profile_memory: bool = False with_stack: bool = False with_flops: bool = False with_modules: bool = False output_trace_dir: typing.Optional[str] = None )

Parameters

  • activities (List[str], optional, default to None) — The list of activity groups to use in profiling. Must be one of "cpu", "xpu", "mtia", or "cuda".
  • schedule_option (Dict[str, int], optional, default to None) — The schedule option to use for the profiler. Available keys are wait, warmup, active, repeat and skip_first. The profiler will skip the first skip_first steps, then wait for wait steps, then do the warmup for the next warmup steps, then do the active recording for the next active steps and then repeat the cycle starting with wait steps. The optional number of cycles is specified with the repeat parameter, the zero value means that the cycles will continue until the profiling is finished.
  • on_trace_ready (Callable, optional, default to None) — Callable that is called at each step when schedule returns ProfilerAction.RECORD_AND_SAVE during the profiling.
  • record_shapes (bool, optional, default to False) — Save information about operator’s input shapes.
  • profile_memory (bool, optional, default to False) — Track tensor memory allocation/deallocation
  • with_stack (bool, optional, default to False) — Record source information (file and line number) for the ops.
  • with_flops (bool, optional, default to False) — Use formula to estimate the FLOPS of specific operators
  • with_modules (bool, optional, default to False) — Record module hierarchy (including function names) corresponding to the callstack of the op.
  • output_trace_dir (str, optional, default to None) — Exports the collected trace in Chrome JSON format. Chrome use ‘chrome://tracing’ view json file. Defaults to None, which means profiling does not store json files.

Use this object in your Accelerator to customize the initialization of the profiler. Please refer to the documentation of this context manager for more information on each argument.

torch.profiler is only available in PyTorch 1.8.1 and later versions.

Example:

from accelerate import Accelerator
from accelerate.utils import ProfileKwargs

kwargs = ProfileKwargs(activities=["cpu", "cuda"])
accelerator = Accelerator(kwargs_handlers=[kwargs])

build

< >

( ) torch.profiler.profile

Returns

torch.profiler.profile

The profiler object.

Build a profiler object with the current configuration.

GradScalerKwargs

class accelerate.GradScalerKwargs

< >

( init_scale: float = 65536.0 growth_factor: float = 2.0 backoff_factor: float = 0.5 growth_interval: int = 2000 enabled: bool = True )

Use this object in your Accelerator to customize the behavior of mixed precision, specifically how the torch.cuda.amp.GradScaler used is created. Please refer to the documentation of this scaler for more information on each argument.

GradScaler is only available in PyTorch 1.5.0 and later versions.

Example:

from accelerate import Accelerator
from accelerate.utils import GradScalerKwargs

kwargs = GradScalerKwargs(backoff_filter=0.25)
accelerator = Accelerator(kwargs_handlers=[kwargs])

InitProcessGroupKwargs

class accelerate.InitProcessGroupKwargs

< >

( backend: typing.Optional[str] = 'nccl' init_method: typing.Optional[str] = None timeout: typing.Optional[datetime.timedelta] = None )

Use this object in your Accelerator to customize the initialization of the distributed processes. Please refer to the documentation of this method for more information on each argument.

Note: If timeout is set to None, the default will be based upon how backend is set.

from datetime import timedelta
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs

kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=800))
accelerator = Accelerator(kwargs_handlers=[kwargs])

KwargsHandler

class accelerate.utils.KwargsHandler

< >

( )

Internal mixin that implements a to_kwargs() method for a dataclass.

to_kwargs

< >

( )

Returns a dictionary containing the attributes with values different from the default of this class.

< > Update on GitHub