Issue with the tokenizer

#1
by intervitens - opened

This model causes a crash when the input includes "<|im_end|>" or "<|im_start|>" tokens.

  1. Nous-mixtral model added two new tokens for the ChatML format, expanding the model's vocab_size and the input embeddings tensor dimension by two, from 32000 to 32002.
  2. This model copied the tokenizer files from the Nous model, but the actual input embeddings are still vanilla size, so when you use those tokens in the prompt, you get a "index out of range" error
    To fix, replace the tokenizer.json and tokenizer_config.json files with the ones from the base Mixtral and delete added_tokens.json file.
Owner

Thanks for the heads up! I have made the change you suggested and hopefully this fixes the crash. (Due to hardware constraints I have to convert everything to .gguf to use it and can't actually test the HF/FP16 version properly).

Sign up or log in to comment