mistralai/Mistral-7B-Instruct-v0.2 · Align tokenizer with mistral-common

Align tokenizer with mistral-common0f1833de

Rocketknight1

Jun 26

No description provided.

Rocketknight1

Jun 26

•

edited Jun 26

This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v1()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", revision="pr/120")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))

Update chat template to handle sys messages726f3925

EnRaoufi

Jul 3

Thanks for the code snippet. Did you try it on a number of chat prompts to see if the two tokenizers' results are the same?

patrickvonplaten changed pull request status to merged Jul 3

EnRaoufi

Jul 4

Does this code mean that MistralTokenizer and AutoTokenizer are tokenizing the text just the same as together? I'm asking because the encoded chat texts are different but the tokens are the same for both tokenizers. In this case, it seems unnecessary to use mistral_inference.generate for model inference; because the inference is also based on tokens. So, I guess there is no need to use mistral_common and mistral_inference, right?