Moving between FSDP And DeepSpeed
🤗 Accelerate offers flexibilty of training frameworks, by integrating two extremely powerful tools for distributed training, namely Pytorch FSDP and Microsoft DeepSpeed. The aim of this tutorial is to draw parallels, as well as to outline potential differences, to empower the user to switch seamlessly between these two frameworks.
To switch between the frameworks, we recommend launching code 🤗 accelerate launch
passing in the correct config file with --config_file
, or passing in the respective arguments directly for FSDP and DeepSpeed .
Example 🤗 Accelerate configurations can be found here for DeepSpeed and FSDP, or in the example zoo under “Launch Configurations”
This tutorial is for single-node, multi-GPU, scenarios only.
Configuring Functionalities
Model tensors are split into different GPUs in an attempt to scale up model sizes; this is termed sharding in FSDP, and partitioning in DeepSpeed. FSDP sharding and DeepSpeed ZeRO (partitioning) stages are configured by --fsdp_sharding_strategy
, and --zero_stage
, respectively. In particular, FSDP FULL_SHARD
maps to DeepSpeed ZeRO stage 3
; see this comprehensive mapping between FSDP sharding and DeepSpeed ZeRO settings. The below table summarizes and groups similar settings:
Group | Framework | Configuration | Example | Restrictions (if any) |
---|---|---|---|---|
sharding / partitioning | FSDP DeepSpeed | --fsdp_sharding_strategy --zero_stage | 1 (FULL_SHARD ) 3 | |
offload | FSDP DeepSpeed | --fsdp_offload_params --offload_param_device --offload_optimizer_device | true cpu cpu | all or nothing |
model loading | FSDP DeepSpeed | --fsdp_cpu_ram_efficient_loading --zero3_init_flag | true true | only ZeRO 3 |
efficient checkpointing | FSDP DeepSpeed | --fsdp_state_dict_type --zero3_save_16bit_model | SHARDED_STATE_DICT true | only ZeRO 3 |
weights prefetching | FSDP DeepSpeed | --fsdp_forward_prefetch --fsdp_backward_prefetch None | true BACKWARD_PRE | |
model | FSDP DeepSpeed | --fsdp_auto_wrap_policy --fsdp_transformer_layer_cls_to_wrap None | TRANSFORMER_BASED_WRAP <Layer Class> | Usually not needed Transparent to user. |
parameters summoning | FSDP DeepSpeed | --fsdp_use_orig_params None | true | required for torch.compile Transparent to user |
parameters syncing | FSDP DeepSpeed | --fsdp_sync_module_states None | true | |
training | FSDP DeepSpeed | None--gradient_accumulation_steps --gradient_clipping | auto auto | Transparent to user |
For detailed descriptions of the above, refer to 🤗 Accelerate
launch documentation.
To access other DeepSpeed configurations, such as mixed precision settings,
you need to pass in a --deepspeed_config_file
, see the documentation.
DeepSpeed can be also configured via DeepSpeedPlugin, e.g., DeepSpeedPlugin.zero_stage
is equivalent of --zero_stage
, and DeepSpeedPlugin.hf_ds_config
can be used to pass --deepeed_config_file.
FSDP can be also configured via FullyShardedDataParallelPlugin, e.g., FullyShardedDataParallelPlugin.sharding_strategy
is equivalent of --fsdp_sharding_strategy
.
Checkpointing
Do note that while FSDP can be configured via --fsdp_state_dict_type
to save either full / sharded checkpoints.
For DeepSpeed Zero3, one could pass a --zero3_save_16bit_model true
, which conveniently consolidates the model to a single rank and saves; this is the FSDP equivalent of fsdp_state_dict_type: FULL_STATE_DICT
.
For large models, consolidating the model to a single rank can be very slow.
For quicker checkpointing, for FSDP use fsdp_state_dict_type: SHARDED_STATE_DICT
, and for DeepSpeed Zero3 use the zero_to_fp32.py
script to post-convert sharded checkpoints.
Offloading
FSDP only allows all-or-nothing offload (i.e., either offload parameters, gradients, and optimizer, or keep them all in GPU), but DeepSpeed can offload parameters and optimizer differently. Furthermore, DeepSpeed also supports offloading to NVME.
Prefetching
FSDP allows two prefetching configurations --fsdp_forward_prefetch
and --fsdp_backward_prefetch
to improve overlap of comms / computation at a cost of extra memory, see FSDP documentation.
For DeepSpeed, the prefetching will be turned on when needed, and it turns on depending on certain hyper-params like stage3_param_persistence_threshold
, stage3_max_reuse_distance
, etc, that can be configured for Zero3; 🤗 accelerate
may set these hyper-params automatically if you don’t set those explicitly in the deepspeed config file.
For FSDP set fsdp_backward_prefetch: BACKWARD_PRE
for improved throughputs if memory allows.
Model Loading
While FSDP require an explicit --fsdp_cpu_ram_efficient_loading true
to activate efficient model loading, 🤗 transformers
will activate the similar feature whenever DeepSpeed Zero3 is used.
For FSDP, whenever setting --fsdp_cpu_ram_efficient_loading true
, 🤗 accelerate
will automatically set sync_module_states
to true.
For RAM efficient loading the weights will be loaded only in a singe rank, and thus requires sync_module_states
to broadcast weights to other ranks.
Model
FSDP requires an explicit --fsdp_auto_wrap_policy
for the algorithm to decide how to schedule the all-gather and reduce-scatter operations. But for DeepSpeed this is transparent to the user.
For FSDP, simply set fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
. With the latest transformers
versions, we try our best to figure out the suitable fsdp_transformer_layer_cls_to_wrap
for HF transformers models. However, if you get an error regarding it, please specify this.
Parameters Summoning
FSDP requires an explicit --fsdp_use_orig_params
flag if using torch.compile
, see the pytorch documenation. For DeepSpeed this is transparent to the user.
For FSDP, when using torch.compile
please set fsdp_use_orig_params: True
.
Training
Deepspeed requires explicit --gradient_accumulation_steps
and --gradient_clipping
flags. For FSDP this is transparent to the user.
When using DeepSpeed, set gradient_accumulation_steps: "auto"
and gradient_clipping: "auto"
to automatically pick up values set in the Accelerator or TrainingArguments
(if using transformers
).
On Differences in Data Precision Handling
To discuss the how data precision is handled in both FSDP and Deepspeed, it is instructive to first give an overview of how model parameters are handled in these frameworks. Before the model / optimizer parameters are distributed across GPUs, parameter preparation is involved to first “flatten” them to one-dimensional torch.Tensor
. The implementation of FSDP / DeepSpeed varies in the respect of the dtype
in which these “flattened” parameters are stored, and there are ramifications with regards to how torch.Optimizer
allocate their dtype
s. The table below outlines the processes for both frameworks; the “Local” column indicates the process occurring at a per-gpu level, therefore any memory overheads by upcasting should be understood to be amortized by the number of gpus used.
As a rule of thumb, for stable training with automatic mixed precision, all the trainable parameters have to be in torch.float32
.
Process | Local | Framework | Details |
---|---|---|---|
Loading, i.e., AutoModel.from_pretrained(..., torch_dtype=torch_dtype) | |||
Preparation, i.e., creation of “flat params” | ✅ | FSDP DeepSpeed | created in torch_dtype .disregards torch_dtype , created in float32 . |
Optimizer initialization | ✅ | FSDP DeepSpeed | creates parameters in torch_dtype creates parameters in float32 |
Training Step, i.e, forward, backward, reduction | FSDP DeepSpeed | follows MixedPrecision follows deepspeed_config_file mixed precision settings. | |
Optimizer (Pre-Step) | ✅ | FSDP DeepSpeed | upcasting (if any) to torch_dtype upcasted to float32 |
Optimizer (Actual Step) | ✅ | FSDP DeepSpeed | occurs in torch_dtype occurs in float32 . |
Therefore when using DeepSpeed a small number of GPUs, be aware of potentially significant memory overheads due to the upcasting during preperation.
With FSDP, in the absence of mixed precision, it is possible to operate the torch.Optimizer
in low precision torch_dtype
, which may be helpful when using small number of GPUs.
With mixed precision, FSDP and DeepSpeed will upcast in the model preparation step (c.f. table above). But do note that FSDP will then save checkpoints in the upcasted precision; Deepspeed may still save low precision checkpoints if --zero3_save_16bit_model
is specified.
To clarify the above table consider the concrete examples below; the optimizer pre- and actual step combined for brevity. With FSDP it is possible to operate in the two modes shown below, but DeepSpeed can only operate in one.
Framework | Model Loading (torch_dtype ) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local) |
---|---|---|---|---|---|
FSDP | bf16 | default (none) | bf16 | bf16 | bf16 |
FSDP | bf16 | bf16 | fp32 | bf16 | fp32 |
DeepSpeed | bf16 | bf16 | fp32 | bf16 | fp32 |