Instruction Template during inference?

#6
by jf-11 - opened

Hi,
I am a bit confused regarding the instruction template.
It is stated that the template has to be followed strictly, does this also hold for the inference? Because in the section "Run the model" the template is not used with the .generate function.

Thanks for clarifying.

@jf-11 It is best to follow the recommended format in inference order to obtain better outcomes.

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]

OR
you could simply use apply_chat_template

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

message_formatted = tokenizer.apply_chat_template(messages, tokenize=False)
<s>[INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s>[INST] Do you have mayonnaise recipes? [/INST]

However, if you don't, it can still work. They just gave an simple example.

I see. Thank you!
So I should also use the [INST], [/INST] literals when using .generate() and tokenizer(text, return_tensors="pt") does add special tokens by default? Otherwise it would be also better to use <s> in the beginning?

@jf-11 No , we have to include it by ourselves. You can conveniently utilize "apply_chat_template" for this purpose. As the conversation history grows, it's advisable to manage it with "apply_chat_template." Afterward, you can proceed with the following code:

inputs = tokenizer(message_formatted, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) 

Follow-up question regarding inference with common end-user backends/frontends like KoboldCpp/SillyTavern:

After inference generates <s> [INST] Instruction [/INST] Model answer</s>, the stop token is usually caught to end inference, then special tokens are removed from the output when it is shown in the frontend. User replies and next context looks like this:

<s> [INST] Instruction [/INST] Model answer [INST] Follow-up instruction [/INST] - so previously generated stop tokens are gone. Is that OK or would that be a problem?

Also, as constructive feedback, I'd have loved to see a system prompt. Especially if we were to implement something like https://docs.mistral.ai/platform/guardrailing/ - it would always be a good thing for the model to be able to differentiate between user and system prompts. But if you do implement one, please don't do it as part of the first INST like Llama 2 Chat's format did, which causes lots of problems because of its terrible design (I'd gladly elaborate if you'd like).

Edited: Using default prompt from llama.cpp solved the problem for me. The prompt from the model page didn't work in my case, maybe I did something wrong ,I don't know.

I'm using this: [INST] Instruction [/INST] Model answer [INST] Follow-up instruction [/INST]
But the model doesn't answer my question. It creates its own question then answer that question. I wonder what I did wrong? Here's an example:

User: what can you do?

Llama: Sure, I'll do my best to provide a model answer for your question. Please provide me with the question or topic you would like me to write about.

User: can you translate

Llama: Sure, I'll do my best to provide a model answer for your question. Please keep in mind that the quality of the response may depend on the complexity and specifics of the question you have asked. Also, note that while I strive to provide accurate information, it is always a good idea to double-check any facts or figures with additional sources. With this said, please let me know what your follow-up instruction is.

User: write a short poem

Llama: Sure, I'll do my best to provide a model answer for your question. Please allow me a moment to understand the context and what is being asked.

If you could provide me with more information about the topic, prompt, or specific language point that you would like me to address in the model answer, it would be helpful. Otherwise, I will provide a general example of how to structure an effective response.

Here's a possible model answer:

Prompt: How has technology impacted education?

Model Answer:

Technology has significantly transformed...

@wolfram I have seen one example of system prompt from deepinfra.com:

[INST] <<SYS>>
your system prompt goes here
<<SYS>>

first question [/INST] ...

It is not from the official document though.

@Cameron-Chen Yeah, that's Llama 2 Chat. I'm using just that with Mixtral, too, and it works good enough.

Also did an extensive test of Mixtral 8x7B Instruct with 17 different instruct templates:

https://www.reddit.com/r/LocalLLaMA/comments/18ljvxb/llm_prompt_format_comparisontest_mixtral_8x7b/

Probably of interest to Mistral, too, if they haven't seen it yet. Especially my explanation of why the Llama 2 Chat format is such a terrible choice and should be replaced with something more flexible and future-proof like the ChatML format.

Sign up or log in to comment