Is there any vllm support for this version?

#49
by Aloukik21 - opened

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (15424). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

Hi @Aloukik21 !
I think this model should be supported in vllm - feel free to open an issue directly on their repository: https://github.com/vllm-project/vllm

@Aloukik21 are you able to resolve it.
I am also stuck in this

@Navanit-shorthills
If thats helps for me it works with this in the "offline_inference.py":
llm = LLM(model="mistralai/Mistral-7B-v0.1", max_model_len=20000, gpu_memory_utilization=0.9 ), depending what GPU ure using you can set the len.

is max_model_len=20000 arbitrary or just simple the max number of tokens i cen expect to inference?

I checked the documentation on this parameter and it says: model context length. If unspecified will be automatically derived from the model

I have the same issue when using the vllm docker container to run the model.
Is there a way to specify the argument gpu_memory_utilization=0.9 in the vllm's docker command? When I execute the docker command:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-7B-Instruct-v0.2  --gpu_memory_utilization 0.9

Then I got the errors as below:

api_server.py: error: unrecognized arguments: --gpu_memory_utilization 0.9

this is the correct argument for cli: --gpu-memory-utilization

Sign up or log in to comment