System prompts ignored in chat completions

#51

by joshuaturner - opened May 1, 2024

May 1, 2024

From https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/discussions/11 :

As of the most recent upload, the template in the published quants lists the chat template as:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}

...which has the net result of ignoring any system prompt passed in.

The breaking change is commit 300945e90b6f55d3cb88261c8e5333fae696f672.

jrc

May 1, 2024

I also have this problem!

gugarosa

Microsoft org May 1, 2024

The model has not been optimized for the system instruction and produces better generations without it.

That’s why we opted to remove altogether any reference to system. Try appending it to your first user prompt, should work better than a separate system instruction.

gugarosa changed discussion status to closed May 1, 2024

joshuaturner

May 1, 2024

Perhaps a discussion rather than simply closing the issues is in order.

Why do you feel that ignoring parameters from the user is better than conforming to the API contract? Would revising the template to treat the system prompt as an additional user prompt not achieve the goal you set out in the thread on the GGUF repo?

jrc

May 1, 2024

I second this^

The user has an expectation that system prompts will be used if they are included in a given dataset. I’d prefer an approach like the one outlined above for GGUF or if you’re going to break this contract completely, it should be widely publicized on the model card

gugarosa

Microsoft org May 2, 2024

This comment has been hidden

gugarosa changed discussion status to open May 2, 2024

jrc

May 2, 2024

@gugarosa Thanks for the follow-up - very eager to hear the report from the MSFT team responsible for finetuning of Phi-3.

(FYI, I believe only repository admin are able to re-open closed Discussions)

joshuaturner

May 2, 2024

@jrc is correct; we don't have the ability to re-open closed discussions.

In my application, I've used the "microsoft/Phi-3" as a magic string to change behaviour - I place the system prompt in a <|user|> block before the rest of the conversation. It seems to work acceptably, and would be implementable in the Jinja template with a swap out of:

{% if (message['role'] == 'user') %}

with

{% if (message['role'] == 'user' or message['role'] == 'system') %}

gugarosa

Microsoft org May 2, 2024

•

edited May 2, 2024



@jrc
	 is correct; we don't have the ability to re-open closed discussions.

Oh god, 100% my bad then, I thought everyone was able to re-open a discussion. Well, now that I know this, I will stop closing them lol

gugarosa

Microsoft org May 2, 2024

We are doing some ablations between including system as an additional <|user|> conversation and prepending the prompt on the first <|user|> conversation.

Will let you know soon the results!

jrc

May 15, 2024

Following up on this @gugarosa - any results to share?

joshuaturner

May 21, 2024

I'm hopping from foot to foot as well. Would love to remove this model-specific hack from my inference app.

halilergul1

May 29, 2024

Hi, any update to this issue? Thanks in advance

jrc

Jun 4, 2024

Hi @gugarosa (or someone from the HF / Microsoft team),

Pinging this thread again - I'm a maintainer on torchtune, where we've included some versions of the Phi-3 model for users to finetune. Currently we include the system prompt as this is what the paper and original model did but obviously this means that our users will not have the same results as users of Hugging Face's SFT Trainer. Therefore, this has been a point of confusion or silent errors.

It would be helpful to have an official recommendation - preferably with the aforementioned ablation results - on how we should handle the system prompt.

Thanks!

nguyenbh

Microsoft org Jul 1, 2024

Thank you all for your feedback! We recently update the model which allows the system prompt. We would love to continue receive your comments and suggestions.

aladar

Jul 7, 2024

•

edited Jul 7, 2024

Thanks @nguyenbh ! Can you share general conclusions of the ablation per @gugarosa 's comment? In general, should we be using the system prompt?

We are doing some ablations between including system as an additional <|user|> conversation and prepending the prompt on the first <|user|> conversation.

Will let you know soon the results!

nguyenbh

Microsoft org Jul 8, 2024

@aladar With the latest update June 2024, you can use the system prompt. The example in model card can be a starting point.

aieat

Jul 12, 2024

Hello @nguyenbh and thank you for adding support for the system prompt. Do you know if this change will be propagated to the larger context variants of Phi3? The Phi3 128k mini and medium. Currently the change does not appear to be there yet.

nguyenbh

Microsoft org Jul 12, 2024

@aieat Thank you for your interest in Phi-3 model family.
The change is propagated to Mini-128K. Other models have no update.

nguyenbh changed discussion status to closed Jul 17, 2024

WelcomeAIOverlords

Oct 11, 2024

•

edited Oct 11, 2024

It's not as good as a proper ablation study, but I did an experiment on a single dataset exploring some of the questions in this thread.

I am doing LoRA fine-tuning with torchtune. My dataset has input/output pairs. I also have a prompt and few-shot examples. For example:
Let's say my training samples are like:

input, output
Input 1, Output 1
Input 2, Output 2

And my few-shot examples are like:

input, output
Example Input 1, Example Output 1
Example Input 2, Example Output 2

And I have the prompt, "My awesome prompt."

In the image below, you'll see LoRA loss curves on the training set with the following color code:

Red: No prompt or few-shot examples, just input/output pairs
- I.e., the model is trained on a string like: <|user|>Input 1<|end>\n<|assistant|>Output1<|end|>\n<|endoftext|>
Blue: The prompt and few-shot examples smushed into the training example's <|user|> input
- Strings like: <|user|>My awesome prompt. Example Input 1\n Example Output 1\n ... Input 1<|end>\n<|assistant|>Output 1<|end|>\n<|endoftext|>
Green: Proper adherence to the template, with the prompt in <|system|>, the few-shot examples in <|user|>...<|assistant|> pairs, and then a final <|user|> / <|assistant|> pair for the training example.
- Strings like: <|system|>My awesome prompt.<|end|>\n<|user|>Example Input 1<|end>\n<|assistant|>Example Output 1<|end|>\n...Input 1<|end>\n<|assistant|>Output 1<|end|>\n<|endoftext|>

My conclusion from the results is that if you're going to do fine-tuning, it doesn't really matter if you smush it all into the first user input, or use the recommended template. Note that the green curve continues the same number of steps as the other curves, but is invisible beyond a certain point because it becomes indistinguishable from the blue in this viz.

aladar

Oct 12, 2024

Thanks for running an experiment @WelcomeAIOverlords , huge help! Do you also have results w/ validation loss and accuracy?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment