How is activation memory calculated?

#1
by trungtvu - opened

Hi, thank you so much for releasing this tool. I already found it extremely useful!

I'm trying to wrap my head around this activation memory calculation.

I have the following setup:

micro_batch_size (b): 1
sequence_length (s): 2048
hidden_size (h): 8192
num_layers (l): 80
num_attention_heads (a): 64

according to https://huggingface.co/spaces/nanotron/ultrascale-playbook, the total number of parameters is:

l x s × b × h × (34 + 5 × (a × s) / h)

multiplying with 2 bytes per params give us:

2 x l x s × b × h × (34 + 5 × (a × s) / h) ~ 306 GB

but the calculation in the tool shows 58964 MB which is ~59 GB.

Screenshot 2025-02-19 at 2.49.13 PM.png

what am I doing wrong here?

You have zero stage 3, which shards the parameters across the devices, and you have 16 GPUs; hence, the per-GPU activation allocation will be less. And I think the activations are also stores in fp32.

Sign up or log in to comment