configs do not match tokenizer vocab size

#5
by carson-together - opened

I think there is a mismatch between the number of tokens in the tokenizer vocab size and config.json

When loading the tokenizer:
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-2-Yi-34B")
we encounter this warning:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are converting a LlamaTokenizer to a LlamaTokenizerFast, but wrong indexes were founds when adding the `added_tokens` from the `slow` tokenizer to the `fast`.  The following tokens had unexpected id :
    expected id: 64000, found: 1,  token: `<|startoftext|>`,
    expected id: 64001, found: 2,  token: `<|endoftext|>`,
. You should try using `from_slow`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

The tokenizer vocab size is 64002
len(tokenizer)

64002

but config.json reports the vocab size is 64000

 "use_cache": false,
  "vocab_size": 64000
}

Yeah.

Mergekit gives this warning:

WARNING:root:Token '<|startoftext|>' present in /home/alpha/Models/Raw/NousResearch_Nous-Hermes-2-Yi-34B tokenizer but >= vocab_size
WARNING:root:Token '<|endoftext|>' present in /home/alpha/Models/Raw/NousResearch_Nous-Hermes-2-Yi-34B tokenizer but >= vocab_size

Correcting the vocab size to 64002 doesn't seem to work either.

Sign up or log in to comment