v3 tokenizer

by ayyylol - opened May 23

May 23

Hi,

I just wanted to let you know that the mistral repo contains a file called:

tokenizer.model.v3

It is my understanding that this is the new tokenizer that contains the expanded vocabulary.

However, when making the gguf, I think it needs to be renamed first to tokenizer.model or else it might be ignored by the convert script.

You might already know all of this though, so feel free to ignore :)

More info about v3 here: https://docs.mistral.ai/guides/tokenization/

bartowski

Owner May 23

I was concerned how renaming it to tokenizer.model would react with GGUF, I can try remaking the conversion with that included to see if that works though

ayyylol

May 23

No problem, it worked for me but I only prompted it a few times.

bartowski

Owner May 23

after my current quantization is done i'll remake this one into a new repo with tokenizerV3 just incase there are differences that cause unexpected breaks

bartowski

Owner May 23

@ayyylol

actually nevermind, @wolfram noticed that their sha256sums are identical, so they're both the exact same file. Not sure why they uploaded the same tokenizer.model twice, hopefully means that it's the correct one already :)

ayyylol

May 23

That's great to hear! Thank you @bartowski

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment