Serving the model as API on vLLM and 2 x A6000
I have 2 x A6000 (so 96 GB VRAM in total)
I am running vllm/vllm-openai docker image
and I am initialising it with:
Docker options: --runtime nvidia --gpus all -v ./workspace:/root/.cache/huggingface -p 8000:8000 --ipc=host
OnStart script: python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --max-model-len 8000
I tried reducing the model sequence length, but still I am unable to fit it in 96GB, it runs fine with 128GB.
Can anyone advise what needs to be done to make it work, do I need to use quantization, I know that vllm has issue with awq and mistral on multi GPU?
Below is log from vLLM start
Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
INFO 01-05 10:44:10 llm_engine.py:275] # GPU blocks: 0, # CPU blocks: 4096
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/vllm/entrypoints/openai/api_server.py", line 737, in
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/workspace/vllm/engine/async_llm_engine.py", line 500, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/workspace/vllm/engine/async_llm_engine.py", line 273, in init
self.engine = self._init_engine(*args, **kwargs)
File "/workspace/vllm/engine/async_llm_engine.py", line 318, in _init_engine
return engine_class(*args, **kwargs)
File "/workspace/vllm/engine/llm_engine.py", line 114, in init
self._init_cache()
File "/workspace/vllm/engine/llm_engine.py", line 279, in _init_cache
raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization
when initializing the engine.