P3 EC2 :: torch.nn DataParallel Error

#42
by AFoudree - opened

Attempting to run local on AWS P3 EC2
I am running into issues spreading the model across the NVIDIA GPUs
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:11<00:00, 2.54it/s]
Let's use 8 GPUs!
Traceback (most recent call last):
File "./application.py", line 29, in
model.to(device)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1173, in to
return self._apply(convert)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply
param_applied = fn(param)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1159, in convert
return t.to(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU

Basically I am looking for a way to use with model parallelism... not data parallelism.

I got it figured out. It just took a bit to get the code to spread the mode across the gpus.

can u share how did u do it ?

Sign up or log in to comment