Help: CUDA Out of Memory. Hardware Requirements. vLLm and FastChat

#44
by zebfreeman - opened

I am trying to load Mixtral8x7b on my local machine to run inference. I am using vLLM to serve the model.
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --load-format safetensors --enforce-eager --worker-use-ray --gpu-memory-utilization .95

I have also tried FastChat .
python3 -m fastchat.serve.model_worker --model-path zeb-7b-v1.4 --model-name zeb --num-gpus 2 --cpu-offloading
as well as trying --load-8bit

None of these methods worked. vLLM just kills the terminal as the model is almost done downloading its weights. And FastChat produces this error as it is loading the last few checkpoint shards:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacty of 47.99 GiB of which 32.88 GiB is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 13.22 GiB is allocated by PyTorch, and 17.69 MiB is reserved by PyTorch but unallocated.

My Desktop consists of:
GPU: 2x RTX 6000 Ada 96GB VRAM (48Gb each)
Memory: 128GB RAM
1TB NVMe SSD
Intel i7

Other post answers are very confusing. Is this not enough VRAM or RAM? What do I need to upgrade in order to be able to run Mixtral? I don't want to use a quantized model. What are the minimum VRAM and RAM requirements to download and use the model for my RAG application? Any suggestions on better model serving options?

Sign up or log in to comment