Spaces:
Running
ZeRO-3 vs FSDP 1 vs FSDP 2
I realize that the blog post equates ZeRO 3 and FSDP. But what are the big differences in the DeepSpeed ZeRO 3 (https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py) implementation and the Meta FSDP implementation (https://github.com/pytorch/pytorch/blob/v2.6.0/torch/distributed/fsdp/fully_sharded_data_parallel.py#L127)?
Additionally, its well documented that the Meta FSDP implementation is less stable for longer training runs than the DeepSpeed ZeRO 3 implementation. Why is this the case? Anything inherent about the differences in the two implementations?
Finally, do you think the stability issues could be solved in FSDP 2 (https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md)?
I realize that the blog post equates ZeRO 3 and FSDP. But what are the big differences in the DeepSpeed ZeRO 3
You can think of FSDP as the pytorch native manner to do ZeRO-3 that was initially implemented in DeepSpeed. The idea of ZeRO-3 stays the same, but each library implements in its own manner to make it compatible with the checkpoints logic, compatibility with other parallelisms etc... You can find some of these differences in accelerate's docs but keep in mind that all libraries with this regard are still evolving to maximize efficiency, so after 1 or 2 months maybe the two libraries will converge to the same implementation
Additionally, its well documented that the Meta FSDP implementation is less stable for longer training runs than the DeepSpeed ZeRO 3 implementation. Why is this the case? Anything inherent about the differences in the two implementations?
source for this claim?
Finally, do you think the stability issues could be solved in FSDP 2 (https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md)?
We included in the blog a link to this nice blog that explains some advantages of FSDP2 over FSDP. (search FSDP2)
Hope that answers your questions! :)