TheBloke/OpenHermes-2.5-Mistral-7B-AWQ · Issue with {"detail":"Not Found"} Responses for All Requests When Running Model in vLLM

Moin!

I’m encountering a persistent issue where all my requests are returning a {"detail":"Not Found"} response when attempting to run a model in vLLM. Below are the details of my setup and the error messages I'm receiving.

Setup:

I initiated the model with the following command:

CUDA_LAUNCH_BLOCKING=1 python3 -m vllm.entrypoints.api_server --model TheBloke/OpenHermes-2.5-Mistral-7B-AWQ --quantization awq --dtype half --max-model-len 512

During the startup, I received warnings and info logs, including a note about awq quantization not being fully optimized and a mention of CUDA graphs capturing the model(see end of post).

Error Encounter:
After the server started on http://0.0.0.0:8000, attempting to access any API endpoint results in a 404 Not Found error. This includes simple model queries and chat completions.

For example:

curl http://localhost:8000/v1/models

returns {"detail":"Not Found"}.

Similarly, posting a chat completion request also yields the same Not Found response.

Comparison to the full model:
The full model updates like this once every few seconds:

INFO 03-17 14:08:17 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

The AWQ does not update in a loop, but just stays put and response with "404 Not found", once a request is made.

Attempts to Resolve:

Ensured that the model and tokenizer are correctly named and accessible.
Checked for any typos in the API endpoint paths.
Searched for similar issues in documentation and forums without success.
Tried some hint using "api" , "app" after the port, to prevent some "dev" Path issues

** Maybe this might be related: **
https://stackoverflow.com/questions/64019054/fastapi-app-results-in-404-error-response-when-it-is-started-using-uvicorn-run

Any ideas if this is the model or vLLM not working properly?

This is my first post here, so I appreciate your patience and any guidance you can offer. Thank you!

Full sequence of vLLM init:

~$ CUDA_LAUNCH_BLOCKING=1 python3 -m vllm.entrypoints.api_server --model TheBloke/OpenHermes-2.5-Mistral-7B-AWQ --quantization awq --dtype half --max-model-len 512
WARNING 03-17 14:10:01 config.py:193] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 03-17 14:10:01 llm_engine.py:87] Initializing an LLM engine with config: model='TheBloke/OpenHermes-2.5-Mistral-7B-AWQ', tokenizer='TheBloke/OpenHermes-2.5-Mistral-7B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-17 14:10:04 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 03-17 14:10:06 llm_engine.py:357] # GPU blocks: 8714, # CPU blocks: 2048
WARNING 03-17 14:10:06 cache_engine.py:103] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 03-17 14:10:06 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-17 14:10:06 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-17 14:10:13 model_runner.py:756] Graph capturing finished in 6 secs.
INFO:     Started server process [68189]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:51422 - "POST /v1/chat/completions HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:59678 - "GET /v1/models HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:47408 - "GET /api/v1/models HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:36870 - "GET /app/v1/models HTTP/1.1" 404 Not Found