meta-llama/Meta-Llama-3-70B-Instruct · P3 EC2 :: torch.nn DataParallel Error

May 3, 2024

Attempting to run local on AWS P3 EC2
I am running into issues spreading the model across the NVIDIA GPUs
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:11<00:00, 2.54it/s]
Let's use 8 GPUs!
Traceback (most recent call last):
File "./application.py", line 29, in
model.to(device)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1173, in to
return self._apply(convert)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply
param_applied = fn(param)
File "./.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1159, in convert
return t.to(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU

AFoudree

May 3, 2024

Basically I am looking for a way to use with model parallelism... not data parallelism.

AFoudree

May 5, 2024

I got it figured out. It just took a bit to get the code to spread the mode across the gpus.

sagarStormborn

May 7, 2024

can u share how did u do it ?