Kwargs Handlers
The following objects can be passed to the main Accelerator to customize how some PyTorch objects related to distributed training or mixed precision are created.
AutocastKwargs
Use this object in your Accelerator to customize how torch.autocast
behaves. Please refer to the
documentation of this context manager for more
information on each argument.
DistributedDataParallelKwargs
class accelerate.DistributedDataParallelKwargs
< source >( dim: int = 0 broadcast_buffers: bool = True bucket_cap_mb: int = 25 find_unused_parameters: bool = False check_reduction: bool = False gradient_as_bucket_view: bool = False static_graph: bool = False comm_hook: DDPCommunicationHookType = <DDPCommunicationHookType.NO: 'no'> comm_wrapper: Literal = <DDPCommunicationHookType.NO: 'no'> comm_state_option: dict = <factory> )
Use this object in your Accelerator to customize how your model is wrapped in a
torch.nn.parallel.DistributedDataParallel
. Please refer to the documentation of this
wrapper for more
information on each argument.
gradient_as_bucket_view
is only available in PyTorch 1.7.0 and later versions.
static_graph
is only available in PyTorch 1.11.0 and later versions.
FP8RecipeKwargs
class accelerate.utils.FP8RecipeKwargs
< source >( backend: Literal = 'MSAMP' opt_level: Literal = 'O2' margin: int = 0 interval: int = 1 fp8_format: Literal = 'E4M3' amax_history_len: int = 1 amax_compute_algo: Literal = 'most_recent' override_linear_precision: Tuple = (False, False, False) )
Parameters
- backend (
str
, optional, defaults to “msamp”) — Which FP8 engine to use. Must be one of"msamp"
(MS-AMP) or"te"
(TransformerEngine). - margin (
int
, optional, default to 0) — The margin to use for the gradient scaling. - interval (
int
, optional, default to 1) — The interval to use for how often the scaling factor is recomputed. - fp8_format (
str
, optional, default to “E4M3”) — The format to use for the FP8 recipe. Must be one ofE4M3
orHYBRID
. - amax_history_len (
int
, optional, default to 1024) — The length of the history to use for the scaling factor computation - amax_compute_algo (
str
, optional, default to “most_recent”) — The algorithm to use for the scaling factor computation. Must be one ofmax
ormost_recent
. - override_linear_precision (
tuple
of threebool
, optional, default to(False, False, False)
) — Whether or not to executefprop
,dgrad
, andwgrad
GEMMS in higher precision. - optimization_level (
str
), one ofO1
,O2
. (default isO2
) — What level of 8-bit collective communication should be used with MS-AMP. In general:- O1: Weight gradients and
all_reduce
communications are done in fp8, reducing GPU memory usage and communication bandwidth - O2: First-order optimizer states are in 8-bit, and second order states are in FP16. Only available when using Adam or AdamW. This maintains accuracy and can potentially save the highest memory.
- 03: Specifically for DeepSpeed, implements capabilities so weights and master weights of models
are stored in FP8. If
fp8
is selected and deepspeed is enabled, will be used by default. (Not available currently).
- O1: Weight gradients and
Use this object in your Accelerator to customize the initialization of the recipe for FP8 mixed precision
training with transformer-engine
or ms-amp
.
For more information on transformer-engine
args, please refer to the API
documentation.
For more information on the ms-amp
args, please refer to the Optimization Level
documentation.
ProfileKwargs
class accelerate.ProfileKwargs
< source >( activities: Optional = None schedule_option: Optional = None on_trace_ready: Optional = None record_shapes: bool = False profile_memory: bool = False with_stack: bool = False with_flops: bool = False with_modules: bool = False output_trace_dir: Optional = None )
Parameters
- activities (
List[str]
, optional, default toNone
) — The list of activity groups to use in profiling. Must be one of"cpu"
,"xpu"
,"mtia"
, or"cuda"
. - schedule_option (
Dict[str, int]
, optional, default toNone
) — The schedule option to use for the profiler. Available keys arewait
,warmup
,active
,repeat
andskip_first
. The profiler will skip the firstskip_first
steps, then wait forwait
steps, then do the warmup for the nextwarmup
steps, then do the active recording for the nextactive
steps and then repeat the cycle starting withwait
steps. The optional number of cycles is specified with therepeat
parameter, the zero value means that the cycles will continue until the profiling is finished. - on_trace_ready (
Callable
, optional, default toNone
) — Callable that is called at each step when schedule returnsProfilerAction.RECORD_AND_SAVE
during the profiling. - record_shapes (
bool
, optional, default toFalse
) — Save information about operator’s input shapes. - profile_memory (
bool
, optional, default toFalse
) — Track tensor memory allocation/deallocation - with_stack (
bool
, optional, default toFalse
) — Record source information (file and line number) for the ops. - with_flops (
bool
, optional, default toFalse
) — Use formula to estimate the FLOPS of specific operators - with_modules (
bool
, optional, default toFalse
) — Record module hierarchy (including function names) corresponding to the callstack of the op. - output_trace_dir (
str
, optional, default toNone
) — Exports the collected trace in Chrome JSON format. Chrome use ‘chrome://tracing’ view json file. Defaults to None, which means profiling does not store json files.
Use this object in your Accelerator to customize the initialization of the profiler. Please refer to the documentation of this context manager for more information on each argument.
torch.profiler
is only available in PyTorch 1.8.1 and later versions.
Example:
from accelerate import Accelerator
from accelerate.utils import ProfileKwargs
kwargs = ProfileKwargs(activities=["cpu", "cuda"])
accelerator = Accelerator(kwargs_handlers=[kwargs])
Build a profiler object with the current configuration.
GradScalerKwargs
class accelerate.GradScalerKwargs
< source >( init_scale: float = 65536.0 growth_factor: float = 2.0 backoff_factor: float = 0.5 growth_interval: int = 2000 enabled: bool = True )
Use this object in your Accelerator to customize the behavior of mixed precision, specifically how the
torch.cuda.amp.GradScaler
used is created. Please refer to the documentation of this
scaler for more information on each argument.
GradScaler
is only available in PyTorch 1.5.0 and later versions.
InitProcessGroupKwargs
class accelerate.InitProcessGroupKwargs
< source >( backend: Optional = 'nccl' init_method: Optional = None timeout: Optional = None )
Use this object in your Accelerator to customize the initialization of the distributed processes. Please refer to the documentation of this method for more information on each argument.
Note: If timeout
is set to None
, the default will be based upon how backend
is set.
KwargsHandler
Internal mixin that implements a to_kwargs()
method for a dataclass.
Returns a dictionary containing the attributes with values different from the default of this class.