ChatML without SystemPrompt doesn't work well

#2
by andreP - opened

Model falls back into completion mode on "1+1=" without an explicit system prompt.

$ curl http://ai1.dev.init:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [{"role": "user", "content": "1+1="}],
"temperature": 0.2,
"top_p": 0.1,
"top_k": 20,
"frequency_penalty": 0.2
}'
{"id":"cmpl-8a1978632c814956a87b19832220daf9","object":"chat.completion","created":60034,"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","choices":[{"index":0,"message":{"role":"assistant","content":"2\n\n1+1等于2。\n\n请注意,在这个 答案中,我使用了自然语言处理技术来理解您的问题,并提供了一个简单的数学答案。如果您有其他问题或需要更详细的解释,请随时告诉我。"},"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":96,"completion_tokens":82}}

With other sampling params the model starts rambling forever.

It works OK with "1+1" without trailing =
This seems to trigger completion mode without reaching stop token.

curl http://ai1.dev.init:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [{"role":"system", "content": "Just ultrashort responses!"}, {"role": "user", "content": "1+1="}],
"temperature": 0.2,
"top_p": 0.1,
"top_k": 20,
"frequency_penalty": 0.2
}'
{"id":"cmpl-89c1d1402873495abe254658ba342621","object":"chat.completion","created":60621,"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","choices":[{"index":0,"message":{"role":"assistant","content":"2"},"finish_reason":"stop"}],"usage":{"prompt_tokens":27,"total_tokens":29,"completion_tokens":2}}

With system prompt (whatever it is), it works:

$ curl http://ai1.dev.init:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
"messages": [{"role":"system", "content": "Just ultrashort responses!"}, {"role": "user", "content": "1+1="}],
"temperature": 0.2,
"top_p": 0.1,
"top_k": 20,
"frequency_penalty": 0.2
}'
{"id":"cmpl-3c4e1c863acc4b2ea7770467f8297c9d","object":"chat.completion","created":60914,"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO","choices":[{"index":0,"message":{"role":"assistant","content":"2"},"finish_reason":"stop"}],"usage":{"prompt_tokens":27,"total_tokens":29,"completion_tokens":2}}

We use vllm-openai:

version: '3.8'
services:
  vllm-nous-hermes-mixtral-instruct:
    image: vllm/vllm-openai
    container_name: vllm-nous-hermes-mixtral-instruct
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [ gpu ]
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,1,2,3
    volumes:
      - /mnt/sda/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command:
      - --model=NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO
      - --gpu-memory-utilization=0.6
      - --tensor-parallel-size=4
    restart: unless-stopped

This is just a minimalistic example, we don't want to misuse the model as calculator for 1+1 ;)
We had other stuff, where the model also kept outputting nonsense like the first completion GPT-models.

May be you should add some more behaviour train data for DPO?

NousResearch org
edited Jan 17

I think something is wrong in general with VLLM inference, but I am not sure what

teknium changed discussion status to closed

Hmm, OK...also a method.
Other Mixtral models like the original one or Sauerkraut-Mixtral don't have this problem at all, with the same inference server setup.
Just adding a system message also changes it's behaviour. Just try it on your side with your setup, whatever that is.
vLLM is one of the most used server-side inference servers besider tensorrt. local playground servers like ollama/llama.cpp are not very relevant for production without continuous batching.

NousResearch org

Hmm, OK...also a method.
Other Mixtral models like the original one or Sauerkraut-Mixtral don't have this problem at all, with the same inference server setup.
Just adding a system message also changes it's behaviour. Just try it on your side with your setup, whatever that is.
vLLM is one of the most used server-side inference servers besider tensorrt. local playground servers like ollama/llama.cpp are not very relevant for production without continuous batching.

You can compare your outputs with that of huggingchat or lmstudio, they are fine without system prompts, it is vllm but I don’t know why

First try on hugging chat (with no previous system message or chat):

1+1=

Output:


2

1+1等于2。


We see - same problem - even though it's shorter, but I don't know what they use for sampling params.
It's not vLLM. Test UIs or local play servers arn't really relevant - without proper support for real practice inference servers the model is not very useful.

I think the model isn't trained sufficiently with a default system prompt or missing system prompt. He falls into strange patterns.

It is as it is, I didn't expect to change the model. It's just feedback, so that you hopefully can improve future model. I'm a huge fan of your work.

Sign up or log in to comment