OutOfMemoryError: CUDA out of memory despite available GPU memory

#58
by humza-sami - opened

I’m encountering an issue with GPU memory allocation while training a GPT-2 model on a GPU with 24 GB of VRAM. Despite having a substantial amount of available memory, I’m receiving the following error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.68 GiB total capacity; 18.17 GiB already allocated; 64.62 MiB free; 18.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Here are the specifications of my setup and the model training:

GPU: NVIDIA GPU with 24 GB VRAM
Model: GPT-2 with approximately 3 GB in size and 800 parameters of 32-bit each
Training Data: 36,000 training examples with vector length of 600
Training Configuration: 5 epochs, batch size of 16, and fp16 enabled
These are my calculations:

Model Size:
GPT-2 model: ~3 GB
Parameters: 800 parameters of 32 bits each
Gradients:
Gradients are typically of the same size as the model’s parameters.
Batch Size and Training Examples:
Batch Size: 16
Training Examples: 36,000
Vector Length: 600
Memory Allocation per Batch:
Model: 3 GB (unchanged per batch)
Gradients: 3 GB (unchanged per batch)
Input Data: 16 x 600 (vector length) x 4 bytes (assuming each value is a 32-bit float) = 37.5 KB per batch
Output Data: 16 x 600 (vector length) x 4 bytes (assuming each value is a 32-bit float) = 37.5 KB per batch
Based on the above calculations, the memory allocation per batch for your scenario would be approximately:

Model: 3 GB
Gradients: 3 GB
Input and Output Data: 75 KB
I would appreciate any insights or suggestions on how to resolve this issue. Thank you in advance for your assistance!

Sign up or log in to comment