Tokenizer adds space between sentence start and instruction start

#74
by ldavid - opened

Is there a way to reconcile the space that is added between the <s> and [INST] tokens? A simple example:

from transformers import AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
chat = [
    {
        "role": "user",
        "content": "You are my Python programming assistant. Write a program that generates the first 10 fibonacci numbers in Python."
    }
]

prompt = tokenizer.apply_chat_template(chat, tokenize=False)
print(prompt)
print(tokenizer.decode(tokenizer(prompt, add_special_tokens=False).get("input_ids")))

This prints:

<s>[INST] You are my Python programming assistant. Write a program that generates the first 10 fibonacci numbers in Python. [/INST]
<s> [INST] You are my Python programming assistant. Write a program that generates the first 10 fibonacci numbers in Python. [/INST]

I'm trying to get only the completion of my prompt by subtracting my prompt from the final model output.

Sign up or log in to comment