Spaces:
Running
How is activation memory calculated?
Hi, thank you so much for releasing this tool. I already found it extremely useful!
I'm trying to wrap my head around this activation memory calculation.
I have the following setup:
micro_batch_size (b): 1
sequence_length (s): 2048
hidden_size (h): 8192
num_layers (l): 80
num_attention_heads (a): 64
according to https://huggingface.co/spaces/nanotron/ultrascale-playbook, the total number of parameters is:
l x s × b × h × (34 + 5 × (a × s) / h)
multiplying with 2 bytes per params give us:
2 x l x s × b × h × (34 + 5 × (a × s) / h) ~ 306 GB
but the calculation in the tool shows 58964 MB which is ~59 GB.
what am I doing wrong here?
You have zero stage 3, which shards the parameters across the devices, and you have 16 GPUs; hence, the per-GPU activation allocation will be less. And I think the activations are also stores in fp32.