Chat prompt

#11
by apepkuss79 - opened

What is the chat prompt? Thanks!

simple:

<s>[INST] {user_prompt} [/INST] {assistant_response} </s><s>[INST] {new_user_prompt} [/INST] 

with system prompt:

<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_prompt} [/INST] {assistant_response} </s><s>[INST] {new_user_prompt} [/INST] 

FIM (not working? see https://huggingface.co/mistralai/Codestral-22B-v0.1/discussions/10):

<s>[SUFFIX]    return sum[PREFIX]def add(

Hope it helps :)

apepkuss79 changed discussion status to closed

Hi @legraphista , in the tests folder of mistral-common repo
https://github.com/mistralai/mistral-common/blob/ce444e276f348e03ae9bf6b6e9b73f3dde1793a2/tests/test_tokenize_v2.py#L87

When you see the output of the text with system prompyt, there is no SYS token, could you please point out where is the SYS and /SYS token being appended in system prompt

hey @vanshils

You are right, the template from above was created using the HF variant, and it appears to be the v1 template, not the v3 one.

tokenizer.encode_chat_completion(
    ChatCompletionRequest(messages=[{
        "role": "system",
        "content": "{sys prompt}"
    }, {
        "role": "user",
        "content": "{user instruct #1}"
    }])
)
# Tokenized(
#  tokens=[1, 3, 1139, 7377, 12278, 29520, 781, 781, 29519, 2606, 13085, 1190, 29508, 29520, 4], 
#  text='<s>[INST]▁{sys▁prompt}<0x0A><0x0A>{user▁instruct▁#1}[/INST]', prefix_ids=None)

Furthermore, the system prompt looks like it's following the last instruct instead of always being at the top:

tokenizer.encode_chat_completion(
    ChatCompletionRequest(messages=[{
        "role": "system",
        "content": "{sys prompt}"
    }, {
        "role": "user",
        "content": "{user instruct #1}"
    }, {
        "role": "assistant",
        "content": "{response #1}"
    }, {
        "role": "user",
        "content": "{user instruct #2}"
    }])
)
# Tokenized(
#  tokens=[1, 3, 1139, 2606, 13085, 1190, 29508, 29520, 4, 1139, 5207, 1190, 29508, 29520, 2, 3, 1139, 7377, 12278, 29520, 781, 781, 29519, 2606, 13085, 1190, 29518, 29520, 4], 
#  text='<s>[INST]▁{user▁instruct▁#1}[/INST]▁{response▁#1}</s>[INST]▁{sys▁prompt}<0x0A><0x0A>{user▁instruct▁#2}[/INST]', prefix_ids=None)

Thanks
if possible do you know where is "_" after "[INST]" getting appended ? I tried very hard to find but cant make hf tokenizer work in the same way as mistral tokenizer.

# Tokenized(
#  tokens=[1, 3, 1139, 7377, 12278, 29520, 781, 781, 29519, 2606, 13085, 1190, 29508, 29520, 4], 
#  text='<s>[INST]▁{sys▁prompt}<0x0A><0x0A>{user▁instruct▁#1}[/INST]', prefix_ids=None)

Sign up or log in to comment