Error: This model's maximum context length is 2000 token
#17
by
joanp
- opened
Hi all,
I am trying to run this model using vLLM but getting that the max context length is 2K, which does not align with what I have been reading. Please, any advice is welcomed.
MODEL=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
PORT=8001
python -m vllm.entrypoints.openai.api_server --model $MODEL --port $PORT --dtype half --enforce-eager \
--quantization gptq \
--max-model-len 4000 \
--gpu-memory-utilization 0.80
Error:
This model's maximum context length is 2000 tokens. However, you requested 2662 tokens (1662 in the messages, 1000 in the completion). Please reduce the length of the messages or completion.