Errors when serving with vllm
#1
by
nielsrolf
- opened
When I try to run vllm serve unsloth/Qwen2.5-32B-Instruct-bnb-4bit
, I get the following error:
root@8bcffee8e088:/vllm-workspace# vllm serve unsloth/Qwen2.5-32B-Instruct-bnb-4bit
INFO 11-29 10:35:16 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 11-29 10:35:16 api_server.py:586] args: Namespace(subparser='serve', model_tag='unsloth/Qwen2.5-32B-Instruct-bnb-4bit', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='unsloth/Qwen2.5-32B-Instruct-bnb-4bit', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7c8cf3564220>)
INFO 11-29 10:35:16 api_server.py:175] Multiprocessing frontend to use ipc:///tmp/3b4d6f80-df17-4966-9518-c46e9f5c1d8c for IPC Path.
INFO 11-29 10:35:16 api_server.py:194] Started engine process with PID 3771
config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.19k/1.19k [00:00<00:00, 3.64MB/s]
INFO 11-29 10:35:24 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
WARNING 11-29 10:35:24 config.py:428] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 11-29 10:35:24 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
tokenizer_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7.51k/7.51k [00:00<00:00, 16.2MB/s]
vocab.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2.78M/2.78M [00:00<00:00, 7.47MB/s]
merges.txt: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.67M/1.67M [00:00<00:00, 16.2MB/s]
tokenizer.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7.03M/7.03M [00:00<00:00, 16.8MB/s]
added_tokens.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 632/632 [00:00<00:00, 2.32MB/s]
special_tokens_map.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 613/613 [00:00<00:00, 2.30MB/s]
INFO 11-29 10:35:30 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
WARNING 11-29 10:35:30 config.py:428] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 11-29 10:35:30 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-29 10:35:30 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='unsloth/Qwen2.5-32B-Instruct-bnb-4bit', speculative_config=None, tokenizer='unsloth/Qwen2.5-32B-Instruct-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Qwen2.5-32B-Instruct-bnb-4bit, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
generation_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 243/243 [00:00<00:00, 776kB/s]
INFO 11-29 10:35:31 selector.py:135] Using Flash Attention backend.
INFO 11-29 10:35:32 model_runner.py:1072] Starting to load model unsloth/Qwen2.5-32B-Instruct-bnb-4bit...
INFO 11-29 10:35:33 weight_utils.py:243] Using model weights format ['*.safetensors']
model-00001-of-00004.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.93G/4.93G [00:09<00:00, 521MB/s]
model-00002-of-00004.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.96G/4.96G [00:10<00:00, 464MB/s]
model-00003-of-00004.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.00G/5.00G [00:09<00:00, 524MB/s]
model-00004-of-00004.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.32G/4.32G [00:09<00:00, 458MB/s]
model.safetensors.index.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 280k/280k [00:00<00:00, 1.99MB/s]
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
ERROR 11-29 10:36:14 engine.py:366] 'layers.0.mlp.down_proj.weight.absmax'
ERROR 11-29 10:36:14 engine.py:366] Traceback (most recent call last):
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-29 10:36:14 engine.py:366] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 11-29 10:36:14 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 11-29 10:36:14 engine.py:366] return cls(ipc_path=ipc_path,
ERROR 11-29 10:36:14 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 11-29 10:36:14 engine.py:366] self.engine = LLMEngine(*args, **kwargs)
ERROR 11-29 10:36:14 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 347, in __init__
ERROR 11-29 10:36:14 engine.py:366] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 11-29 10:36:14 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 36, in __init__
ERROR 11-29 10:36:14 engine.py:366] self._init_executor()
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
ERROR 11-29 10:36:14 engine.py:366] self.driver_worker.load_model()
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 152, in load_model
ERROR 11-29 10:36:14 engine.py:366] self.model_runner.load_model()
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1074, in load_model
ERROR 11-29 10:36:14 engine.py:366] self.model = get_model(vllm_config=self.vllm_config)
ERROR 11-29 10:36:14 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
ERROR 11-29 10:36:14 engine.py:366] return loader.load_model(vllm_config=vllm_config)
ERROR 11-29 10:36:14 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 334, in load_model
ERROR 11-29 10:36:14 engine.py:366] model.load_weights(self._get_all_weights(model_config, model))
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 503, in load_weights
ERROR 11-29 10:36:14 engine.py:366] loader.load_weights(weights)
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 229, in load_weights
ERROR 11-29 10:36:14 engine.py:366] autoloaded_weights = list(self._load_module("", self.module, weights))
ERROR 11-29 10:36:14 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 190, in _load_module
ERROR 11-29 10:36:14 engine.py:366] yield from self._load_module(prefix,
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 175, in _load_module
ERROR 11-29 10:36:14 engine.py:366] module_load_weights(weights)
ERROR 11-29 10:36:14 engine.py:366] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 371, in load_weights
ERROR 11-29 10:36:14 engine.py:366] param = params_dict[name]
ERROR 11-29 10:36:14 engine.py:366] ~~~~~~~~~~~^^^^^^
ERROR 11-29 10:36:14 engine.py:366] KeyError: 'layers.0.mlp.down_proj.weight.absmax'
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 347, in __init__
self.model_executor = executor_class(vllm_config=vllm_config, )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 36, in __init__
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 152, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1074, in load_model
self.model = get_model(vllm_config=self.vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model
return loader.load_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 334, in load_model
model.load_weights(self._get_all_weights(model_config, model))
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 503, in load_weights
loader.load_weights(weights)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 229, in load_weights
autoloaded_weights = list(self._load_module("", self.module, weights))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 190, in _load_module
yield from self._load_module(prefix,
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 175, in _load_module
module_load_weights(weights)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 371, in load_weights
param = params_dict[name]
~~~~~~~~~~~^^^^^^
KeyError: 'layers.0.mlp.down_proj.weight.absmax'
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
[rank0]:[W1129 10:36:15.022498648 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/zmq/_future.py", line 400, in poll
raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/usr/local/lib/python3.12/dist-packages/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 609, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
If anyone knows how to fix this, I'd appreciate it.