OOM under vLLM even with 80GB GPU

#2
by mike-ravkine - opened

Attempting to load this model with vLLM on an A100-80GB gives me:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 0 has a total capacty of 79.18 GiB of which 302.31 MiB is free. Process 
18600 has 78.88 GiB memory in use. Of the allocated memory 69.30 GiB is allocated by PyTorch, and 522.41 MiB is reserved by PyTorch but unallocated. If 
reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

Has anyone had any luck with this thing?

Set gpu-utilitization-ratio to 0.5?

Set gpu-utilitization-ratio to 0.5?

Does not work. Doesn't seem to change anything.

The model is loaded correctly w/o OOM errors.
But standard prompt (which works fine with vllm - openchat) fails due to wrong prompt_template.

On an A100 40GB device, I managed to get the model loaded with the following vLLM setup:

sampling_params = SamplingParams(temperature=0.1, top_p=0.95)
model = 'TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ'

llm = LLM(
  model=model, 
  gpu_memory_utilization=0.7, 
  max_model_len=2048,
)

Increasing gpu memory allocation caused CUDA OOM errors. Decreasing it also resulted in an error:

ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

However, now that I have the model running, I'm encoutering the bug discussed in #3.

Sign up or log in to comment