TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ · Error: This model's maximum context length is 2000 token

Hi all,

I am trying to run this model using vLLM but getting that the max context length is 2K, which does not align with what I have been reading. Please, any advice is welcomed.

MODEL=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
PORT=8001

python -m vllm.entrypoints.openai.api_server --model $MODEL --port $PORT --dtype half --enforce-eager \
--quantization gptq \
--max-model-len 4000 \
--gpu-memory-utilization 0.80

Error:

This model's maximum context length is 2000 tokens. However, you requested 2662 tokens (1662 in the messages, 1000 in the completion). Please reduce the length of the messages or completion.