No [PREFIX] and [SUFFIX] in tokenizer vocab

#10
by Vokturz - opened

Hi, I was trying to use the FIM feature with no success. After playing with the tokenizer MistralTokenizer.v3() I found that both [PREFIX] and [SUFFIX] tokens points to (id 0):

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.tokenizers.base import FIMRequest
tokenizer = MistralTokenizer.v3()
tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).text
>>> '<s><unk>return▁a▁+▁b<unk>▁def▁f('

tokenizer.instruct_tokenizer.tokenizer.get_control_token('[INST]')
>>> 3
tokenizer.instruct_tokenizer.tokenizer.get_control_token('[PREFIX]')
>>> 0
tokenizer.instruct_tokenizer.tokenizer.get_control_token('[SUFFIX]')
>>> 0
tokenizer.instruct_tokenizer.tokenizer._vocab[:5]
>>> ['<unk>', '<s>', '</s>', '[INST]', '[/INST]']

I found this test in mistral/mistral-common repository:

from mistral_common.tokens.tokenizers.base import FIMRequest
from mistral_common_private.tokens.tokenizers.mistral import MistralTokenizer
tokenizer =  MistralTokenizer.v3()
tokenized = tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b"))
assert tokenized.text == "<s>[SUFFIX]return▁a▁+▁b[PREFIX]▁def▁f("

It must exists a privated tokenizer related to mistral_common_private 🤔. Hence, the public tokenizer has no option to do FIM?

Mistral AI_ org

Great catch @Vokturz ! We rushed that code from mistral/mistral-common a bit too much yesterday - it's indeed wrong!

The tokenizer will need to be updated as well - bear with me, should be done in 30min!

If you just process the generated text as shown here: https://huggingface.co/mistralai/Codestral-22B-v0.1#fill-in-the-middle-fim it shouldn't have made a difference, but it's indeed better to have the correct tokens set for [SUFFIX] and [PREFIX]

hey @patrickvonplaten

using the provided code:

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.tokenizers.base import FIMRequest
tokenizer = MistralTokenizer.v3()

print(tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).text)
print(tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).tokens)

prints

'<s><unk>return▁a▁+▁b<unk>▁def▁f('
[1, 0, 1575, 1032, 1416, 1055, 0, 1569, 1053, 29500]

By the looks of it, even the encoding it not setting the right token

Even after the upload of new tokenizer, any reason that I am getting the following output if i download the latest hf commit.

from transformers import AutoTokenizer
tokenizer= AutoTokenizer.from_pretrained(".")
>>> tokenizer.convert_tokens_to_id("[SUFFIX]")
 0
>>> tokenizer.convert_tokens_to_ids("[PREFIX]")
0
>>> tokenizer.convert_tokens_to_ids("[INST]")
3

because they seem to be using their own tokenizer format... tokenizer.model.v3 rather than the hf formats tokenizer.json, etc. Why? i dunno... seems strange, maybe push people to use their code and become more dependant on mistral...

Sign up or log in to comment