Error: This model's maximum context length is 2000 token

#17
by joanp - opened

Hi all,

I am trying to run this model using vLLM but getting that the max context length is 2K, which does not align with what I have been reading. Please, any advice is welcomed.

MODEL=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
PORT=8001

python -m vllm.entrypoints.openai.api_server --model $MODEL --port $PORT --dtype half --enforce-eager \
--quantization gptq \
--max-model-len 4000 \
--gpu-memory-utilization 0.80

Error:

This model's maximum context length is 2000 tokens. However, you requested 2662 tokens (1662 in the messages, 1000 in the completion). Please reduce the length of the messages or completion.

Sign up or log in to comment