Duplicate tokens

#15
by noobhappylife - opened

While taking a closer look at the tokenizer.json, i noticed the following added_tokens have the same content.

    {
      "id": 128268,
      "content": "<|reserved_special_token_262|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 128269,
      "content": "<|reserved_special_token_262|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
NousResearch org

:o hmm

I think this may affect correctness if the user uses https://github.com/huggingface/tokenizers with this tokenizer.json, which would ignore the second appearance and shift tokens >= 128269 forward, shrinking the vocab size by one. So when decoding, it may not match what the model actually means. I am guessing regenerating a tokenizer.json should fix it, since the duplication is not there in tokenizer_config.json.

NousResearch org

Try now

Sign up or log in to comment