Spaces:

nanotron
/

predict_memory

Running

App Files Files Community

How is activation memory calculated?

by trungtvu - opened Feb 19

Discussion

trungtvu

Feb 19

Hi, thank you so much for releasing this tool. I already found it extremely useful!

I'm trying to wrap my head around this activation memory calculation.

I have the following setup:

micro_batch_size (b): 1
sequence_length (s): 2048
hidden_size (h): 8192
num_layers (l): 80
num_attention_heads (a): 64

according to https://huggingface.co/spaces/nanotron/ultrascale-playbook, the total number of parameters is:

l x s × b × h × (34 + 5 × (a × s) / h)

multiplying with 2 bytes per params give us:

2 x l x s × b × h × (34 + 5 × (a × s) / h) ~ 306 GB

but the calculation in the tool shows 58964 MB which is ~59 GB.

what am I doing wrong here?

Athekunal

Mar 4

You have zero stage 3, which shards the parameters across the devices, and you have 16 GPUs; hence, the per-GPU activation allocation will be less. And I think the activations are also stores in fp32.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment