Out of memory with 24 GB VRAM on Runpod?

#3
by nichedreams - opened

Hey Bloke, I'm trying to run this on your runpod with a 3090 but I get the
CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.69 GiB total capacity; 22.64 GiB already allocated; 41.81 MiB free; 22.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any idea why that is?

Hmm that's odd. Normally a 30/33B GPTQ with no groupsize will just fit into 24GB VRAM at max context length, ie around 2000 tokens returned.

What were you doing at the time?

I wasn't doing anything really, after it finally downloaded I just selected it from the model list and set the wbits and model type as per instructions. After clicking save and then reload I got the error.

Oh wait, I just noticed what repo we're in :P

Yes a 65B will not load on 1x 24GB card! A 30/33B will. A 65B needs 2 x 24GB cards. Or else you could offload half the layers to CPU, but it will be slow as hell

nichedreams changed discussion status to closed

Oh wait, I just noticed what repo we're in :P

Yes a 65B will not load on 1x 24GB card! A 30/33B will. A 65B needs 2 x 24GB cards. Or else you could offload half the layers to CPU, but it will be slow as hell

can you load this 65B model on 2 x 3090? with oobabooga webui?

Yes you can. You should be able to achieve this with something like: --pre_layer 35 55 to put 35 layers on GPU 1 and 55 on GPU 2. You need to put fewer layers on GPU 1 as it also has to run the context

I don't know the exact numbers so you will need to experiment. But it is possible.

I had the same problem. In my case everything was ok, except choosen GPU: A4000 on RunPod. When I switched to different GPU problem disappeared :) I was benchmarking some small LLM

Sign up or log in to comment