Tokenizer seemingly missing FIM tokens

#3
by NyxKrage - opened

according to the mistral reference implementation the FIM tokens should be [PREFIX], [MIDDLE], and [SUFFIX] as can be seen here https://github.com/mistralai/mistral-common/blob/main/src/mistral_common/tokens/tokenizers/base.py#L20-L22

These are apparently not in the converted tokenizer

Ok, apparently this is, intended and matches up with the official implementation, I had naively assumed that these would be separate tokens like the instruct tokens

NyxKrage changed discussion status to closed

Ok, apparently this is, intended and matches up with the official implementation, I had naively assumed that these would be separate tokens like the instruct tokens

Are you saying that the tokens are default [PREFIX], [MIDDLE], and [SUFFIX], so they don't have to show up in tokenizer config?

The mistral community has converted it too and it doesn't appear on their tokenizer config:

https://huggingface.co/mistral-community/Codestral-22B-v0.1/blob/main/tokenizer_config.json

i've updated the convert script to fix the tokenizer and re-did the HF model https://huggingface.co/legraphista/Codestral-22B-v0.1-hf-FIM-fix

Thanks! I just updated the model with the fix :)

Sign up or log in to comment