Possible error in tokenizer.json

#6
by sszymczyk - opened

In tokenizer.json we have:

{
  "id": 8,
  "content": "[TOOL_RESULT]",
  "single_word": false,
  "lstrip": false,
  "rstrip": false,
  "normalized": true,
  "special": true
},
{
  "id": 9,
  "content": "[/TOOL_RESULTS]",
  "single_word": false,
  "lstrip": false,
  "rstrip": false,
  "normalized": true,
  "special": true
}

later in the file there is:

  "[TOOL_RESULTS]": 8,
  "[/TOOL_RESULTS]": 9,

So I think the token number 8 shall be "[TOOL_RESULTS]", not "[TOOL_RESULT]".

Yes, this seems to be a typo and why the docs use the mistral-common tokenizer instead of the HF tokenizer

https://github.com/mistralai/mistral-common/blob/fcf0316163433af072f3cb157664c867661cbda7/src/mistral_common/tokens/tokenizers/base.py#L16

I was trying to quantize it to 8-bit and I get this error:

"Writing Mixtral-8x22B-instruct-q8_0.gguf, format 7
Traceback (most recent call last):
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1548, in
main()
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1542, in main
OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1207, in write_all
check_vocab_size(params, vocab, pad_vocab=pad_vocab)
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1049, in check_vocab_size
raise ValueError(msg)
ValueError: Vocab size mismatch (model has 32768, but Mixtral-8x22B-Instruct-v0.1/tokenizer.json has 32769)."

I was trying to quantize it to 8-bit and I get this error:

"Writing Mixtral-8x22B-instruct-q8_0.gguf, format 7
Traceback (most recent call last):
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1548, in
main()
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1542, in main
OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1207, in write_all
check_vocab_size(params, vocab, pad_vocab=pad_vocab)
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1049, in check_vocab_size
raise ValueError(msg)
ValueError: Vocab size mismatch (model has 32768, but Mixtral-8x22B-Instruct-v0.1/tokenizer.json has 32769)."

I had this error too, edit tokenizer.json and correct "[TOOL_RESULT]" to "[TOOL_RESULTS]" in token 8 definition, then repeat the conversion and quantization steps. It worked for me after this fix.

Good catch - I've opened a PR here to fix it.

Good catch - I've opened a PR here to fix it.

It needs to be fixed on the model card as well (in the special tokens list) to avoid confusion.

Mistral AI_ org

Fixed. Thanks! Please let us know if there are other issues.

sophiamyang changed discussion status to closed

Sign up or log in to comment