Toolmaker. Software creator, optimizer and harmonizer.
Makes things work and fly at Contextual.AI
Training LLM/RAG/Generative AI/Machine Learning/Scalability
If you remember the Bigscience BLOOM-176B training, Tunji Ruwase and I co-invented this technology for Megatron-Deepspeed in order to enable to quickly scale up and down node topology while continuing training.
Since then the DeepSpeed team continued improving on that and it has now been fully integrated into Deepspeed.
A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.
A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.