Chat prompt
What is the chat prompt? Thanks!
simple:
<s>[INST] {user_prompt} [/INST] {assistant_response} </s><s>[INST] {new_user_prompt} [/INST]
with system prompt:
<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>
{user_prompt} [/INST] {assistant_response} </s><s>[INST] {new_user_prompt} [/INST]
FIM (not working? see https://huggingface.co/mistralai/Codestral-22B-v0.1/discussions/10):
<s>[SUFFIX] return sum[PREFIX]def add(
Hope it helps :)
Hi
@legraphista
, in the tests folder of mistral-common repo
https://github.com/mistralai/mistral-common/blob/ce444e276f348e03ae9bf6b6e9b73f3dde1793a2/tests/test_tokenize_v2.py#L87
When you see the output of the text with system prompyt, there is no SYS token, could you please point out where is the SYS and /SYS token being appended in system prompt
hey @vanshils
You are right, the template from above was created using the HF variant, and it appears to be the v1 template, not the v3 one.
tokenizer.encode_chat_completion(
ChatCompletionRequest(messages=[{
"role": "system",
"content": "{sys prompt}"
}, {
"role": "user",
"content": "{user instruct #1}"
}])
)
# Tokenized(
# tokens=[1, 3, 1139, 7377, 12278, 29520, 781, 781, 29519, 2606, 13085, 1190, 29508, 29520, 4],
# text='<s>[INST]▁{sys▁prompt}<0x0A><0x0A>{user▁instruct▁#1}[/INST]', prefix_ids=None)
Furthermore, the system prompt looks like it's following the last instruct instead of always being at the top:
tokenizer.encode_chat_completion(
ChatCompletionRequest(messages=[{
"role": "system",
"content": "{sys prompt}"
}, {
"role": "user",
"content": "{user instruct #1}"
}, {
"role": "assistant",
"content": "{response #1}"
}, {
"role": "user",
"content": "{user instruct #2}"
}])
)
# Tokenized(
# tokens=[1, 3, 1139, 2606, 13085, 1190, 29508, 29520, 4, 1139, 5207, 1190, 29508, 29520, 2, 3, 1139, 7377, 12278, 29520, 781, 781, 29519, 2606, 13085, 1190, 29518, 29520, 4],
# text='<s>[INST]▁{user▁instruct▁#1}[/INST]▁{response▁#1}</s>[INST]▁{sys▁prompt}<0x0A><0x0A>{user▁instruct▁#2}[/INST]', prefix_ids=None)
Thanks
if possible do you know where is "_" after "[INST]" getting appended ? I tried very hard to find but cant make hf tokenizer work in the same way as mistral tokenizer.
# Tokenized(
# tokens=[1, 3, 1139, 7377, 12278, 29520, 781, 781, 29519, 2606, 13085, 1190, 29508, 29520, 4],
# text='<s>[INST]▁{sys▁prompt}<0x0A><0x0A>{user▁instruct▁#1}[/INST]', prefix_ids=None)