Doesn't follow instruction very well

#3
by andreP - opened

Hi,

i try to summarize Markdown tables like this (Markdown table content excluded):

INFO 12-22 07:50:41 async_llm_engine.py:379] Received request cmpl-42304499db4546e9957718a83f18c31f: prompt: '<s>GPT4 Correct System: Du antwortest kurz und präzise.<|end_of_turn|>GPT4 Correct User: Erstelle eine kurze ZUSAMMENFASSUNG (High Level Überblick in 2-3 kurzen Sätzen) zur folgenden Tabelle (Markdown): .....MARKDOWN-TABLE....<|end_of_turn|>GPT4 Correct Assistant:', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=3425, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)

I go via vLLM and it applies your tokenizer.chat_template correctly, as far i can see.
This LLM just repeats the table in a more or less jumbled mess.

ChatGPT does it just fine and also the smaller and lower ranked OpenChat 3.5 1210 works perfectly for this.
I tried many different prompts and tables - it just isn't doing very good on text-analyzing tasks.

Best regards,
André

VAGO solutions org

Hey andreP,

Thanks for testing and your feedback!
Your template doesn't look quite like the one we provide in the model card or tokenizer_config. But I haven't used vllm before, so maybe it will append the template in the background?
Regarding your parameters: please test temp = 0.3 - 0.5, top_p = 0.9, top_k = 20 and rep penalty to 1.15.
Please let us know if this worked for you.

Hi,

sry have send the wrong prompt log from OpenChat (I compared your model with that one).
The template application of vllm looks right:

INFO 12-23 11:23:06 async_llm_engine.py:379] Received request cmpl-c944d8ddf21a40e29a5256a785fa7f91: prompt: '### System:\nDu antwortest kurz und präzise.\n\n### User:\nErstelle eine kurze ZUSAMMENFASSUNG (High Level Überblick in 2-3 kurzen Sätzen) zur folgenden Tabelle (Markdown): ......MARKDOWN........\n\n### Assistant:\n', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=1.15, repetition_penalty=1.0, temperature=0.4, top_p=0.9, top_k=20, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5772, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: ....

vllm just uses the tokenizer.chat_template and offers an OpenAI compatible chat endpoint.
For testing vLLM, just use this docker-compose.yml (adapt volume and gpu-devices/mem):

version: '3.8'
services:
vllm-sauerkraut:
image: vllm/vllm-openai:latest
container_name: vllm-sauerkraut
environment:
- HUGGING_FACE_HUB_TOKEN=
- NVIDIA_VISIBLE_DEVICES=1
volumes:
- /mnt/sda/vllm:/root/.cache/huggingface
ports:
- "8002:8000"
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
restart: unless-stopped
command:
- --model=VAGOsolutions/SauerkrautLM-SOLAR-Instruct
- --gpu-memory-utilization=0.8

And this curl, sry can't provide my MARKDOWN - private data:

$ curl --location --insecure --request POST 'http://ai1.dev.init:8002/v1/chat/completions'
-H "Content-Type: application/json"
-d '{
"model": "VAGOsolutions/SauerkrautLM-SOLAR-Instruct",
"messages": [{"role": "system", "content": "Du antwortest kurz und präzise."},{"role": "user", "content": "Erstelle eine kurze ZUSAMMENFASSUNG (High Level Überblick in 2-3 kurzen Sätzen) zur folgenden Tabelle (Markdown): ......MARKDOWN....."}],
"temperature": 0.4,
"top_p": 0.9,
"top_k": 20,
"frequency_penalty": 1.15
}'

I changed the params a bit around, yes the output changes - but it doesn't really get much better and has very unstable behaviour.
I havn't checked, if the SOLAR base model also has this problem and you just inherited that.
OpenChat also answers perfectly with these (same) params. In my experience, these params can improve the model response, but if it's very bad at default, it will not get really good with magic sampling params.

May be you haven't overtrained on Benchmark test data (like you wrote), but many models in the leader board are just optimizing for benchmark-metrics anway. The benchmarks just don't test all aspects (and never can).

VAGO solutions org

Hi andreP,
Many thanks for the constructive feedback! I personally noticed that the model does not perform so well with a system prompt in the template. Could you do me a favor and test the whole thing again without System prompt? But you should include the content of the system prompt in the user prompt part.

Hi,

Thanks for your feedback. I've tested it based on your suggestions. Indeed, if I combine the system prompt with the first user message and use your recommended sampling parameters, it works. However, if I alter any part of this setup—for example, by reducing the frequency_penalty, temperature, or top_p settings, or by separating the system message—then it doesn't work. It's extremely sensitive to these factors.

When I use your VAGOsolutions/SauerkrautLM-Mixtral-8x7B-Instruct, I don't encounter this issue and can vary the sampling parameters broadly without receiving bogus responses (though, it doesn't support separate system prompts, which is one less variable to consider). Disadvantage is, that it's 4-5 times slower with vLLM AWQ (Unquantized doesn't fit onto my L40).

Thanks for the feedback. I won't be using this model—it's just too sensitive to meta settings for my text analysis tasks.

Looking forward to your future work! It's impressive.

DavidGF changed discussion status to closed

Sign up or log in to comment