Is there any vllm support for this version?
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (15424). Try increasing gpu_memory_utilization
or decreasing max_model_len
when initializing the engine.
Hi
@Aloukik21
!
I think this model should be supported in vllm - feel free to open an issue directly on their repository: https://github.com/vllm-project/vllm
@Aloukik21
are you able to resolve it.
I am also stuck in this
@Navanit-shorthills
If thats helps for me it works with this in the "offline_inference.py":
llm = LLM(model="mistralai/Mistral-7B-v0.1", max_model_len=20000, gpu_memory_utilization=0.9 ), depending what GPU ure using you can set the len.
@EricMMD Thank you
is max_model_len=20000 arbitrary or just simple the max number of tokens i cen expect to inference?
I checked the documentation on this parameter and it says: model context length. If unspecified will be automatically derived from the model
I have the same issue when using the vllm docker container to run the model.
Is there a way to specify the argument gpu_memory_utilization=0.9
in the vllm's docker command? When I execute the docker command:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-Instruct-v0.2 --gpu_memory_utilization 0.9
Then I got the errors as below:
api_server.py: error: unrecognized arguments: --gpu_memory_utilization 0.9
this is the correct argument for cli: --gpu-memory-utilization
@Aloukik21 Thanks!