OOM under vLLM even with 80GB GPU

by mike-ravkine - opened Dec 27, 2023

Dec 27, 2023

Attempting to load this model with vLLM on an A100-80GB gives me:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 0 has a total capacty of 79.18 GiB of which 302.31 MiB is free. Process 
18600 has 78.88 GiB memory in use. Of the allocated memory 69.30 GiB is allocated by PyTorch, and 522.41 MiB is reserved by PyTorch but unallocated. If 
reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

Has anyone had any luck with this thing?

m0wer

Dec 31, 2023

same here, opened https://github.com/vllm-project/vllm/issues/2312

Yhyu13

Jan 1

Set gpu-utilitization-ratio to 0.5?

m0wer

Jan 1

Set gpu-utilitization-ratio to 0.5?

Does not work. Doesn't seem to change anything.

IgBHolder

Jan 4

The model is loaded correctly w/o OOM errors.
But standard prompt (which works fine with vllm - openchat) fails due to wrong prompt_template.

hiranya911

Jan 17

On an A100 40GB device, I managed to get the model loaded with the following vLLM setup:

sampling_params = SamplingParams(temperature=0.1, top_p=0.95)
model = 'TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ'

llm = LLM(
  model=model, 
  gpu_memory_utilization=0.7, 
  max_model_len=2048,
)

Increasing gpu memory allocation caused CUDA OOM errors. Decreasing it also resulted in an error:

ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

However, now that I have the model running, I'm encoutering the bug discussed in #3.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment