Issue with {"detail":"Not Found"} Responses for All Requests When Running Model in vLLM

#1
by pimav - opened

Moin!

I’m encountering a persistent issue where all my requests are returning a {"detail":"Not Found"} response when attempting to run a model in vLLM. Below are the details of my setup and the error messages I'm receiving.

Setup:

I initiated the model with the following command:

CUDA_LAUNCH_BLOCKING=1 python3 -m vllm.entrypoints.api_server --model TheBloke/OpenHermes-2.5-Mistral-7B-AWQ --quantization awq --dtype half --max-model-len 512

During the startup, I received warnings and info logs, including a note about awq quantization not being fully optimized and a mention of CUDA graphs capturing the model(see end of post).

Error Encounter:
After the server started on http://0.0.0.0:8000, attempting to access any API endpoint results in a 404 Not Found error. This includes simple model queries and chat completions.

For example:

curl http://localhost:8000/v1/models

returns {"detail":"Not Found"}.

Similarly, posting a chat completion request also yields the same Not Found response.

Comparison to the full model:
The full model updates like this once every few seconds:

INFO 03-17 14:08:17 metrics.py:213] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

The AWQ does not update in a loop, but just stays put and response with "404 Not found", once a request is made.

Attempts to Resolve:

  • Ensured that the model and tokenizer are correctly named and accessible.
  • Checked for any typos in the API endpoint paths.
  • Searched for similar issues in documentation and forums without success.
  • Tried some hint using "api" , "app" after the port, to prevent some "dev" Path issues

** Maybe this might be related: **
https://stackoverflow.com/questions/64019054/fastapi-app-results-in-404-error-response-when-it-is-started-using-uvicorn-run

Any ideas if this is the model or vLLM not working properly?

This is my first post here, so I appreciate your patience and any guidance you can offer. Thank you!


Full sequence of vLLM init:

~$ CUDA_LAUNCH_BLOCKING=1 python3 -m vllm.entrypoints.api_server --model TheBloke/OpenHermes-2.5-Mistral-7B-AWQ --quantization awq --dtype half --max-model-len 512
WARNING 03-17 14:10:01 config.py:193] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 03-17 14:10:01 llm_engine.py:87] Initializing an LLM engine with config: model='TheBloke/OpenHermes-2.5-Mistral-7B-AWQ', tokenizer='TheBloke/OpenHermes-2.5-Mistral-7B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 03-17 14:10:04 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 03-17 14:10:06 llm_engine.py:357] # GPU blocks: 8714, # CPU blocks: 2048
WARNING 03-17 14:10:06 cache_engine.py:103] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 03-17 14:10:06 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-17 14:10:06 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-17 14:10:13 model_runner.py:756] Graph capturing finished in 6 secs.
INFO:     Started server process [68189]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:51422 - "POST /v1/chat/completions HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:59678 - "GET /v1/models HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:47408 - "GET /api/v1/models HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:36870 - "GET /app/v1/models HTTP/1.1" 404 Not Found

Sign up or log in to comment