Possible error in tokenizer.json

by sszymczyk - opened Apr 17, 2024

Apr 17, 2024

In tokenizer.json we have:

{
  "id": 8,
  "content": "[TOOL_RESULT]",
  "single_word": false,
  "lstrip": false,
  "rstrip": false,
  "normalized": true,
  "special": true
},
{
  "id": 9,
  "content": "[/TOOL_RESULTS]",
  "single_word": false,
  "lstrip": false,
  "rstrip": false,
  "normalized": true,
  "special": true
}

later in the file there is:

  "[TOOL_RESULTS]": 8,
  "[/TOOL_RESULTS]": 9,

So I think the token number 8 shall be "[TOOL_RESULTS]", not "[TOOL_RESULT]".

josejg

Apr 17, 2024

Yes, this seems to be a typo and why the docs use the mistral-common tokenizer instead of the HF tokenizer

https://github.com/mistralai/mistral-common/blob/fcf0316163433af072f3cb157664c867661cbda7/src/mistral_common/tokens/tokenizers/base.py#L16

Spidernic

Apr 17, 2024

I was trying to quantize it to 8-bit and I get this error:

"Writing Mixtral-8x22B-instruct-q8_0.gguf, format 7
Traceback (most recent call last):
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1548, in
main()
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1542, in main
OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1207, in write_all
check_vocab_size(params, vocab, pad_vocab=pad_vocab)
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1049, in check_vocab_size
raise ValueError(msg)
ValueError: Vocab size mismatch (model has 32768, but Mixtral-8x22B-Instruct-v0.1/tokenizer.json has 32769)."

sszymczyk

Apr 17, 2024

I was trying to quantize it to 8-bit and I get this error:

"Writing Mixtral-8x22B-instruct-q8_0.gguf, format 7
Traceback (most recent call last):
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1548, in
main()
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1542, in main
OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1207, in write_all
check_vocab_size(params, vocab, pad_vocab=pad_vocab)
File "/Users/spider/Desktop/llama.cpp/convert.py", line 1049, in check_vocab_size
raise ValueError(msg)
ValueError: Vocab size mismatch (model has 32768, but Mixtral-8x22B-Instruct-v0.1/tokenizer.json has 32769)."

I had this error too, edit tokenizer.json and correct "[TOOL_RESULT]" to "[TOOL_RESULTS]" in token 8 definition, then repeat the conversion and quantization steps. It worked for me after this fix.

Rocketknight1

Apr 17, 2024

Good catch - I've opened a PR here to fix it.

sszymczyk

Apr 17, 2024

Good catch - I've opened a PR here to fix it.

It needs to be fixed on the model card as well (in the special tokens list) to avoid confusion.

sophiamyang

Mistral AI_ org Apr 19, 2024

Fixed. Thanks! Please let us know if there are other issues.

sophiamyang changed discussion status to closed Apr 19, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment