VRAM consumption when using GPU (CUDA)

#37

by Sunjay353 - opened Jul 1, 2024

Discussion

Sunjay353

Jul 1, 2024

•

edited Jul 1, 2024

I noticed that the VRAM usage increases by around the model size when loading the model, which is expected. However, it then increases again by roughly twice the model size during inference. This means the VRAM consumption is approximately three times the model size overall. Furthermore, this additional utilization is not released after inference, only at model unload. Is this normal and expected behavior?

pi-null-mezon

Oct 24, 2024

Yes, it's normal and expected. Transformers consume memory proportional to the square of the tokens number in sequence.

mrhendrey

Nov 13, 2024

Suggest wrapping the call to model.generate() into a torch.no_grad() context manager to see if that helps:

with torch.no_grad():
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3,
    )

Sunjay353

Nov 13, 2024

Thank you for the feedback!

Sunjay353 changed discussion status to closed Nov 13, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment