Phi-3 gives multiple responses with <|end|><|assistant|> as separator

#8
by hugandfesse - opened

Hello everyone,

compared to phi-2, when using the new prompt template (<|user|>\nQuestion <|end|>\n<|assistant|>), it returns multiple different answers with "<|end|><|assistant|>" between them when using llama.cpp:

<|user|>\nThere are three people in a room. Someone then enters the room. And another leaves. How many people are there in the room?\n<|end|>\n<|assistant|> There are still three people in the room. The number of people hasn't changed due to entering or leaving. The statement might be misleading, but logically, it's still three people in the room.<|end|><|assistant|> However, if we strictly follow the instruction as written (ignoring the actual count), then based on the scenario described, there are four people in the room after someone enters: the original three people plus the new person who enters. The person who left is no longer in the room, so we only count the remaining three plus the newcomer, which equals four.<|end|><|assistant|> The answer can depend on the interpretation. If we are only counting the actual number of people currently in the room, there are still three people. If we consider the people involved in the action described, then there are four people (three originals plus the newcomer). However, since the instruction doesn't ask for a count of people involved in the action but rather the current count in the room, the answer remains three people in the room.<|end|><|assistant|> Based on the scenario, if we are considering the number of people present in the room at any given moment after the described actions, then the answer would be four. Here is the breakdown:

  1. Initially, there are three people in the room.
  2. One person enters, making it four people.
  3. Another person leaves, leaving us back to three people in the room.

However, the question seems to be asking for the number of people currently in the room after the changes described, which would be four, considering both the original three and the person who entered.<|end|><|assistant|> When considering the scenario and the actions described, there are still three people in the room after one person enters and another person leaves. The actions of entering and leaving don't change the number of people actually present in the room at that moment. The instruction might be leading to a different interpretation, but based on the given information and standard logical reasoning, we would conclude that there are three people in the room.<|end|><|assistant|> To clarify, the question asks about the number of people in the room after someone enters and another leaves, not the total number of people who have been in or are associated with the room. Therefore, if we consider only those present in the room after the described actions, the answer remains three people.

  1. Initially, there are three people in the room.
  2. One person enters, making the total four people.
  3. One person leaves, returning the total number of people in the room to three.

Thus, after these events, the number of people present in the room is three.<|end|><|assistant|> Given the sequence of events, we can determine the number of people in the room at the moment described. Here's the sequence:

  1. We start with three people in the room.
  2. Then, one person enters, increasing the count to four people.
  3. Subsequently, one person leaves, reducing the count back to three people.

Therefore [Message cut here but it goes longer with many more, sometimes never stopping]

I have tried different types of prompt templating with no luck.

Do you see the same behaviour?

I get this too.

It's also profoundly disobedient about instructions about how to format things or to not write any additional text apart from what it was told to write - it insists on elaborating and explaining itself, even when it's been explicitly told not to. Makes it useless for inserting into automated pipelines. It should either have the finetune improved with examples of following strict instructions, or to have the elaboration / explanations fall under a separate tag so they can be filtered out.

Exact. Now what I don't understand is why it does not respect the end tag.

ggerganov released llama.cpp version with phi-3 support:
https://github.com/ggerganov/llama.cpp/releases/tag/b2717

You can easily convert original files from here to gguf:
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

Try template (I like that one):
<|user|> {User} <|end|><|assistant|> {Assistant}

LM Studio (v0.2.21) screenshots with 16-bit Phi-3 mini 4k (with my template above):
Phi-3 - What are generators in Python.png
Phi-3 JSON - What are generators in Python.png

Pulled the last version and compiled it. Works like a charm now.

@MisterBisquits and @hugandfesse is there a place where the fixed GGUFs exist? I just download every uploaded version on HF, including this official one by Microsoft, and all ~10 talk past the end token in every app I tried, including GPT4All and koboldcpp.

Also, are you sure it's fixed after compiling with the last update. People have made this claim and it still sometimes does and does not talk past the end token. I even saw a video of someone using Microsoft Phi-3 and Microsoft Azure and it was still talking past the end token.

My gut is telling me that even when the prompt template is adhered to perfectly Phi3 will still periodically talk past the end token because because Phi3 has a tendency to suddenly go off in random directions while ignoring the system and user prompts while doing so. And I think this includes the end token because sometimes it will work, while other times it will be displayed and ignored, followed by a sharp tangent from the response that came before it.

Edit: Also, it's not just about the end token. Often before the end token it will show what should be behind the scenes tagged data by helpers, examples, instructions and so on. I've tested about 100 LLMs and I've never seen anything like this before. The fine-tuning and alignment is overpowering everything, including the prompt template, system prompt, and even the user prompt.

@Phil337 ,

The currect page contains gguf files converted before ggerganov released his llama.cpp with phi-3 support. So, I am not really sure what it is actually.

I cannot talk for Microsoft, but there is an inconsistency between actual training datasets and used chat formats here. The only simple way to fix this that I know of is to use that template I showed above (to avoid newline characters in there), I had no problems with it.

I am starting to think that some information on some pages on HF are not really official, not complete, or disrupted intentionally.

PS: I am afraid there is no 100% reliable source of information anywhere and anymore.

@MisterBisquits Thanks, perhaps it will just take a week or two for Phi-3 to become usable for laymen LLM users like me. A couple quantized versions of Phi-3 instruct fail to load in GPT4ALL and koboldcpp, so perhaps they were made with the new llama.cpp and will work with future versions of the apps.

For now I tried your prompt with several working versions of Phi-3 Instruct and it sometimes works (stops after giving an answer). Although it always displays the end tag.

However, even with your provided prompt template it will periodically start showing behind the scenes info like explanations, teacher, examples... For example, my prompt was "In a single sentence, what is the capital of Brazil?" and it gave the answer, followed by an explanation. Sometimes it will keep responding to itself, and before long it's talking about France, electromagnetism or some other random tangent.

"The capital of Brazil is Brasília.

Explanation: The task was to provide the name of the capital city of Brazil in one concise sentence. The answer directly addresses this by stating the capital's name without additional information."

@Phil337 gpt4all already has new version with llama-3 and phi-3 support. Didn't test it much though.

https://github.com/nomic-ai/gpt4all/releases/tag/v2.7.4

@MisterBisquits Thanks, that's what I'm using. Apparently none of the currently available Phi-3 GGUFs work with it yet. However, I'm testing the newly released Cinder-llamafied and it appears to be respecting the prompt format.

However, even with your provided prompt template it will periodically start showing behind the scenes info like explanations, teacher, examples... For example, my prompt was "In a single sentence, what is the capital of Brazil?" and it gave the answer, followed by an explanation. Sometimes it will keep responding to itself, and before long it's talking about France, electromagnetism or some other random tangent.

"The capital of Brazil is Brasília.

Explanation: The task was to provide the name of the capital city of Brazil in one concise sentence. The answer directly addresses this by stating the capital's name without additional information."

I hate so much how it does that all the time.

@MisterBisquits I tried a couple dozen prompts with Phi-3-mini-4k-instruct-Cinder-with-16bit-GGUF, GPT4All v2.7.4 and your prompt template. Thanks for your help. You appear to be correct. GGUFs made with the new Llama.cpp and running in v2.7.4 are working correctly.

This thread is the closest I've got to a working Phi 3, where working = ends after a message.

Unfortunately, it doesn't work :(

1 Cinder does work: but the Cinder model seems trained on additional data, using a different template. ex. its end of sequence is <calc>, which isn't a thing in Phi 3.
2. the llama.cpp update added support for converting Phi 3 non-GGUF to GGUF, via ex. python llama.cpp/convert-hf-to-gguf.py ~/dev/Phi-3-mini-4k-instruct
3. the llama.cpp update does not have anything else that would make it support Phi-3 better
4. llama.cpp also recently added a method for checking if a token is end of sequence: llama_token_is_eog(model, tokenid). Before, you'd use: token_id_i_just_got == llama_token_eos(model). Neither method helps.
5. None of the chat templates ever emit a token ID with 32000+ (EDIT: modulo 32001 for <|assistant|>):
-- A) the chat template in docs in these repos
-- B) the chat template in code (which uses a bos_token of <s>, i.e. adds <s> at the beginning of the chat) (EDIT: this seems correct given this update to README.md on non-GGUF repo. With that change, about 1 in 5 times I get 32001, <|assistant|>)
-- C) The one-line template mentioned above by MisterBisquits

If you needed to hack something together that might work "\n\n<|assistant/user/system|>" seems to consistently mark the end of a response, if you use the chat template in docs in these repos. I haven't tested it heavily because the situation seems bad enough that A) I should wait for a fix or B) we need finetuned models.

Thanks @jpohhhh putting in exactly what you said at the end of the prompt template, including the two leading spaces, worked. This was driving me nuts. Thanks again.

"<|end|>\n\n<|assistant/user/system|>"

I've updated llama.cpp, and I've tried ending the prompt with "<|end|>\n\n<|assistant/user/system|>", but I still am getting run-on chats, sometimes starting with ==response==, other times with Instruction: followed by SYSTEM and USER.

@Phil337 which two leading spaces are you referring to?

@nooneofconsequence Microsoft did some weird ass shit with phi-3.

Firstly, the following prompt template only works on the Phi-3-mini-4k-instruct.Q4_0.gguf released by GPT4ALL, and when using GPT4All (all that I tested, it might work elsewhere). I think the llamafied version here on HF will also work. But note that both versions aren't as good as the original so I wouldn't suggest using them. They periodically produce errors like two words fused together, as well as weird story contradictions that the original version released by Microsoft does not.

Secondly, the issue seems to be less about end tokens and more about Microsoft not following the standard system, user and assistant prompt template. Instead, they're randomly injecting one or more placeholder prompts after the user prompt, such as "<|placeholder1|>", in order to give guidance on how to respond (e.g. give plenty of details), and also to enforce alignment (e.g. don't use naughty words). And you don't know if it's 1 or more that are going to be used, hence the confusion.

"<|system|>
You are a helpful assistant.<|end|>"

"<|user|>
%1<|end|>

<|assistant|>
%2<|end|>

<|assistant/user/system|><|end|><|end|><|end|>"

Sign up or log in to comment