ZeRO Optimization Strategies for Large-Scale Model Training - A brief Performance Analysis

Community Article Published September 3, 2025

A Technical Deep-Dive into DeepSpeed ZeRO Performance Characteristics for training AI Systems


Summary

As AI scales to handle complex tasks, our choice of memory optimization strategies becomes critical for both training efficiency and economic viability. This short analysis examines the performance characteristics of DeepSpeed's (Microsoft) Zero Redundancy Optimizer (ZeRO) stages across different hardware configurations, with specific focus on large-scale continued pretraining for production AI models. Through empirical testing on NVIDIA H100s, we discovered the trade-offs between memory efficiency, training throughput, and communication overhead across ZeRO-0, ZeRO-2, and ZeRO-3 configurations. The results may vary with your setups.

Introduction

The development of robust SLM AI still requires training on big datasets, across diverse domains. Modern deep learning workloads demand accurate utilization of GPU infrastructure while maintaining cost-effectiveness and training stability.

The Zero Redundancy Optimizer (ZeRO) family of techniques, introduced by Microsoft, forcus on memory bottlenecks in large model training by partitioning optimizer states, gradients, and model parameters across distributed devices. However, the optimal ZeRO configuration varies significantly based on model size, hardware topology, and dataset characteristics.

Important: ZeRO is for Multi-GPU Training Only

Clarification: ZeRO optimization strategies are designed exclusively for multi-GPU distributed training scenarios. If you're training on a single GPU, you would use standard PyTorch training without any ZeRO configuration.

When to Consider ZeRO:

  • 2+ GPUs in your training setup
  • Model size approaching GPU memory limits
  • Need to scale beyond single-GPU memory constraints

When NOT to Use ZeRO:

  • Single GPU training
  • Models that easily fit in single GPU memory with room to spare

The info in this article assumes multi-GPU training scenario

Methodology

Experimental Setup

Hardware Configuration:

  • 8x NVIDIA H100 (80GB HBM3)
  • NVLink 4.0 with NVSwitch fabric
  • NVIDIA Fabric Manager enabled
  • InfiniBand HDR networking

Model Specifications:

  • Base Model: Gemma2-9B architecture - as our test model
  • Training Type: Continued pretraining (CPT)
  • Dataset: ~4.5M examples
  • Sequence Length: 2048 tokens
  • Precision: bfloat16

Framework:

  • DeepSpeed with ZeRO optimization
  • Transformers 4.55.0+

ZeRO Configuration Matrix

We evaluated three primary ZeRO configurations:

  1. ZeRO-0 (Baseline DDP): No parameter sharding
  2. ZeRO-2: Optimizer state + gradient sharding
  3. ZeRO-3: Full parameter + optimizer + gradient sharding

Results and Analysis

Memory Utilization

ZeRO StagePeak GPU MemoryMemory per GPUModel Capacity
ZeRO-076GB76GB~7B params max
ZeRO-245GB45GB~13B params
ZeRO-328GB28GB~30B params

Key Finding: ZeRO-3 enables training of significantly larger models on the same hardware, with a 2.7x reduction in memory requirements compared to standard DDP.

Training Throughput

ConfigurationTokens/sec/GPUEffective BatchSteps/HourRelative Performance
ZeRO-02,847144385100% (baseline)
ZeRO-22,69814436594.7%
ZeRO-32,23414430278.5%

Communication Overhead Analysis

The performance degradation in ZeRO-3 primarily stems from increased all-gather operations for parameter reconstruction:

ZeRO-0: Parameter broadcast at initialization only ZeRO-2: All-reduce for gradients (2x communication vs ZeRO-0)
ZeRO-3: All-gather for parameters + all-reduce for gradients (
4x communication)

However, on H100 systems with NVSwitch, the high-bandwidth interconnect (~900 GB/s) significantly mitigates this overhead compared to PCI-e or network joined systems... so keep this in mind.

Batch Size Impact on ZeRO Performance

Note: Batch size heavily affects ZeRO stage performance characteristics, with larger batches favoring ZeRO-3 and smaller batches favoring ZeRO-0/ZeRO-2. Make sure you lock in your batch size before you start playing with ZeRO stages, then tweak it from there.

Communication Amortization Effects

Effective Batch SizeZeRO-0 PerformanceZeRO-2 PerformanceZeRO-3 PerformanceOptimal Choice
32-64100% (baseline)92%72%ZeRO-0
96-144100% (baseline)95%78%ZeRO-2
192-384100% (baseline)96%85%ZeRO-2/ZeRO-3
512+100% (baseline)97%89%ZeRO-3 viable

Why Batch Size Matters:

  1. Communication Frequency: Larger batches reduce the frequency of parameter all-gather operations in ZeRO-3, amortizing communication overhead across more computation.

  2. Memory Pressure: Larger batches increase activation memory, making ZeRO-3's parameter sharding more beneficial even with communication costs.

  3. Pipeline Efficiency: Higher batch sizes improve GPU utilization, making the relative cost of ZeRO-3's communication overhead less significant, if you have enough memory..

Cost-Performance Trade-offs

For production AI workloads, the optimal ZeRO stage depends on the specific use case:

ZeRO-0: Best for models ≤7B parameters where memory permits

  • Highest throughput
  • Minimal implementation complexity
  • Limited scalability

ZeRO-2: Optimal sweet spot for 9-13B models

  • Good throughput retention (95%)
  • 40% memory reduction
  • Moderate complexity

ZeRO-3: Essential for models >13B parameters

  • Enables training of models impossible with ZeRO-0/2
  • Significant memory savings (65% reduction)
  • Acceptable throughput for large model amortization

Implications for Large-Scale AI Systems

Model Architecture Considerations

Large-scale models present unique challenges for ZeRO optimization:

  1. Parameter Distribution: Models with large embedding layers or dense feed-forward networks benefit more from ZeRO-3's parameter sharding capabilities.

  2. Sequence Length Scaling: Our analysis shows that ZeRO-3 becomes increasingly advantageous as sequence lengths exceed 1024 tokens, common in document processing, code generation, and long-context applications.

  3. Model Capacity vs Performance: The parameter efficiency gains from ZeRO-3 enable training larger models that capture more complex patterns, improving performance on downstream tasks.

Recommendations

Based on analysis, we recommend the following ZeRO selection criteria:

For Research and Development:

  • Models ≤7B: ZeRO-0 for maximum iteration speed
  • Models 7-13B: ZeRO-2 for balanced performance
  • Models >13B: ZeRO-3 (only viable option)

For Production Training:

  • Established Architectures: ZeRO-2 for reliability and performance
  • Large-Scale Experiments: ZeRO-3 for maximum model capacity
  • Multi-Node Setups: ZeRO-3 with careful communication optimization

Hardware-Specific Guidance:

  • H100/A100 with NVSwitch: ZeRO-3 communication overhead is manageable
  • PCIe-only Systems: Prefer ZeRO-2 unless memory-constrained
  • Mixed Precision: Always enable bfloat16 on modern hardware

Decision Tree: A Systematic Approach

1. Can your model fit in memory with standard DDP?
   ├── YES: Consider ZeRO-0 for maximum speed
   └── NO: → Go to 2

2. Can your model fit with optimizer/gradient sharding (ZeRO-2)?
   ├── YES: Consider ZeRO-2 for balanced performance
   └── NO: → Must use ZeRO-3

3. Is this a production deployment or research exploration?
   ├── PRODUCTION: Prefer ZeRO-2 for reliability
   └── RESEARCH: Consider ZeRO-3 for larger model experiments

4. How critical is training speed vs model capacity?
   ├── SPEED CRITICAL: ZeRO-0 > ZeRO-2 > ZeRO-3
   └── CAPACITY CRITICAL: ZeRO-3 > ZeRO-2 > ZeRO-0

5. What's your communication bandwidth?
   ├── HIGH (NVSwitch/InfiniBand): ZeRO-3 viable
   └── LIMITED (PCIe): Avoid ZeRO-3 unless necessary

Conclusion

ZeRO optimization represents a shift in how we approach large-scale model training, of our SLMs. The choice of ZeRO stage significantly impacts both technical feasibility and economic efficiency.

Our analysis demonstrates that while ZeRO-3 introduces computational overhead, the memory efficiency gains enable training complex models that would otherwise be impossible. For production SLMs, this trade-off often favors ZeRO-3, particularly when combined with modern high-bandwidth interconnects like NVSwitch from your cloud provider, or locally.


References

  1. Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). ZeRO: Memory optimizations for distributed deep learning training. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). Microsoft Research | arXiv:1910.02054

  2. Ren, J., Rajbhandari, S., Aminabadi, R. Z., Ruwase, O., Yang, S., Zhang, M., ... & He, Y. (2021). ZeRO-offload: Democratizing billion-scale model training. 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX | arXiv:2101.06840

  3. Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Z., Awan, A. A., ... & He, Y. (2022). DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. Proceedings of the 39th International Conference on Machine Learning (ICML 2022). Microsoft Research | arXiv:2201.05596

  4. DeepSpeed Documentation. https://www.deepspeed.ai/

  5. NVIDIA H100 Tensor Core GPU Architecture. NVIDIA Technical Brief


Community

Sign up or log in to comment