v3 tokenizer

#1
by ayyylol - opened

Hi,

I just wanted to let you know that the mistral repo contains a file called:

tokenizer.model.v3

It is my understanding that this is the new tokenizer that contains the expanded vocabulary.

However, when making the gguf, I think it needs to be renamed first to tokenizer.model or else it might be ignored by the convert script.

You might already know all of this though, so feel free to ignore :)

More info about v3 here: https://docs.mistral.ai/guides/tokenization/

I was concerned how renaming it to tokenizer.model would react with GGUF, I can try remaking the conversion with that included to see if that works though

No problem, it worked for me but I only prompted it a few times.

after my current quantization is done i'll remake this one into a new repo with tokenizerV3 just incase there are differences that cause unexpected breaks

@ayyylol

actually nevermind, @wolfram noticed that their sha256sums are identical, so they're both the exact same file. Not sure why they uploaded the same tokenizer.model twice, hopefully means that it's the correct one already :)

That's great to hear! Thank you @bartowski

Sign up or log in to comment