llm_memory_visualizer / details.py
rubenaghayan's picture
minor clarification
a15fb05
raw
history blame
2.15 kB
DETAILS = """
### Motivation
Existing tools like the [Hugging Face Model Memory Estimator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage), [DeepSpeed Calculator](https://huggingface.co/spaces/andstor/deepspeed-model-memory-usage), and [DeepSpeed Native Utility](https://deepspeed.readthedocs.io/en/latest/memory.html) are valuable but don't support the full range of modern training configurations.
This tool adds:
- Arbitrary model configurations beyond preset architectures
- FSDP and 5D parallelism support
- Interactive memory breakdowns by category to inform configuration decisions
### References
Helpful resources used while building this:
- [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook)
- [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198)
- [Transformer Math - Michael Wornow](https://michaelwornow.net/2024/01/18/counting-params-in-transformer)
- [Transformer Math 101](https://blog.eleuther.ai/transformer-math/)
"""
INSTRUCTIONS = """
This calculator will estimate a corse upper bound for memory used per GPU during training (excluding intermediates)
## How to Use
1. Use Presets OR Adjust the parallelism, model, and training panels to match your run.
2. Press **Calculate** to refresh the memory breakdown chart.
3. Review the details and references below for context on the estimates.
"""
LIMITATIONS = """
### Key Assumptions:
- Standard transformer architecture with homogeneous layers
- Adam optimizer
- Mixed precision keeps master weights copy
- Tensor parallelism includes sequence parallelism
- Pipeline parallelism maintains consistent activation memory due to schedule
### Not Currently Supported:
- Non-standard architectures (alternating dense/sparse layers, custom attention)
- Multi-modal models with vision layers
- Non-homogeneous parameter dtypes (e.g. BF16 & MXFP4 in GPT-OSS). Mixed Precision is supported.
- Kernel/framework overhead and intermediate memory
For advanced configurations, results should be validated against profiling.
"""