v3 tokenizer

#4
by ayyylol - opened

Hi,

I just wanted to let you know that the mistral repo contains a file called:

tokenizer.model.v3

It is my understanding that this is the new tokenizer that contains the expanded vocabulary.

However, when making the gguf, I think it needs to be renamed first to tokenizer.model or else it might be ignored by the convert script.

You might already know all of this though, so feel free to ignore :)

Hi @ayyylol

Interesting! I saw that, but I thought it is a model used in their own inference library rather than "this is the actual tokenizer with extended vocabs". Are you sure the actual tokenizer.model doesn't have those vocabs? Like the GGUF models don't work with function calling tokens?

Upon looking at this more closely, they are both identical!

37f00374dea48658ee8f5d0f21895b9bc55cb0103939607c8185bfd1c6ca1f89 tokenizer.model
37f00374dea48658ee8f5d0f21895b9bc55cb0103939607c8185bfd1c6ca1f89 tokenizer.model.v3

I am pretty confused now!

You are right, they appear to be identical. Thank you for looking into that!

Trying to install this model with PrivateGPT, I get this complaint about the tokenizer:

Downloading tokenizer mistralai/Mistral-7B-Instruct-v0.3
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers

Indeed, the 'tokenizer_config.json' file has this attribute set true for v0.3. It wasn't included in v0.2.

Is there anything you can do to make the v0.3 tokenizer a fast tokenizer? Thanks in advance for your help.

Sign up or log in to comment