Tokenisation for fill in the middle prompts broken

#1
by jonastemplestein - opened

Hi there, thanks a lot for building all these GGUFs for us!

Not sure if this is the right place to ask this, but the GGUF version of this model doesn't seem to tokenise and detokenise the special strings for fill in the middle correctly. Looking at the original model's tokenizer.json, I can see this

{
      "id": 32015,
      "content": "<|fim▁hole|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 32016,
      "content": "<|fim▁begin|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 32017,
      "content": "<|fim▁end|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    }

But loading any of the GGUFs from this repo into llama.cpp's server example and hitting the /detokenize endpoint results in "unordered_map::at: key not found" and hitting /tokenize gives me the wrong tokens

Is there some way to look at the tokenizer configuration within the GGUF? Could quantisation somehow lead to this?

Sign up or log in to comment