Spaces:

rubenaghayan
/

llm_memory_visualizer

Sleeping

App Files Files Community

llm_memory_visualizer / details.py

rubenaghayan

minor clarification

a15fb05 2 months ago

raw

history blame

2.15 kB

	DETAILS = """
	### Motivation
	Existing tools like the [Hugging Face Model Memory Estimator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage), [DeepSpeed Calculator](https://huggingface.co/spaces/andstor/deepspeed-model-memory-usage), and [DeepSpeed Native Utility](https://deepspeed.readthedocs.io/en/latest/memory.html) are valuable but don't support the full range of modern training configurations.

	This tool adds:
	- Arbitrary model configurations beyond preset architectures
	- FSDP and 5D parallelism support
	- Interactive memory breakdowns by category to inform configuration decisions

	### References
	Helpful resources used while building this:
	- [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook)
	- [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198)
	- [Transformer Math - Michael Wornow](https://michaelwornow.net/2024/01/18/counting-params-in-transformer)
	- [Transformer Math 101](https://blog.eleuther.ai/transformer-math/)
	"""

	INSTRUCTIONS = """
	This calculator will estimate a corse upper bound for memory used per GPU during training (excluding intermediates)
	## How to Use
	1. Use Presets OR Adjust the parallelism, model, and training panels to match your run.
	2. Press Calculate to refresh the memory breakdown chart.
	3. Review the details and references below for context on the estimates.
	"""

	LIMITATIONS = """
	### Key Assumptions:
	- Standard transformer architecture with homogeneous layers
	- Adam optimizer
	- Mixed precision keeps master weights copy
	- Tensor parallelism includes sequence parallelism
	- Pipeline parallelism maintains consistent activation memory due to schedule

	### Not Currently Supported:
	- Non-standard architectures (alternating dense/sparse layers, custom attention)
	- Multi-modal models with vision layers
	- Non-homogeneous parameter dtypes (e.g. BF16 & MXFP4 in GPT-OSS). Mixed Precision is supported.
	- Kernel/framework overhead and intermediate memory

	For advanced configurations, results should be validated against profiling.
	"""