MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF

v3 tokenizer

by ayyylol - opened May 23, 2024

May 23, 2024

Hi,

I just wanted to let you know that the mistral repo contains a file called:

tokenizer.model.v3

It is my understanding that this is the new tokenizer that contains the expanded vocabulary.

However, when making the gguf, I think it needs to be renamed first to tokenizer.model or else it might be ignored by the convert script.

You might already know all of this though, so feel free to ignore :)

MaziyarPanahi

Owner May 23, 2024

Hi @ayyylol

Interesting! I saw that, but I thought it is a model used in their own inference library rather than "this is the actual tokenizer with extended vocabs". Are you sure the actual tokenizer.model doesn't have those vocabs? Like the GGUF models don't work with function calling tokens?

MaziyarPanahi

Owner May 23, 2024

Upon looking at this more closely, they are both identical!

37f00374dea48658ee8f5d0f21895b9bc55cb0103939607c8185bfd1c6ca1f89 tokenizer.model
37f00374dea48658ee8f5d0f21895b9bc55cb0103939607c8185bfd1c6ca1f89 tokenizer.model.v3

I am pretty confused now!

ayyylol

May 24, 2024

You are right, they appear to be identical. Thank you for looking into that!

bradhutchings

May 27, 2024

Trying to install this model with PrivateGPT, I get this complaint about the tokenizer:

Downloading tokenizer mistralai/Mistral-7B-Instruct-v0.3
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers

Indeed, the 'tokenizer_config.json' file has this attribute set true for v0.3. It wasn't included in v0.2.

Is there anything you can do to make the v0.3 tokenizer a fast tokenizer? Thanks in advance for your help.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment