Loading SPM tokenizer shows 32000 vocab size instead of 32064

#47
by jrc - opened

The Phi3 paper claims that the Phi3 Mini sentencepiece tokenizer has a vocab size of 32064. However, when I load the tokenizer using the following code, I see that the vocab size for the saved model is only 32000.

>>> from sentencepiece import SentencePieceProcessor
>>> tokenizer = SentencePieceProcessor()
>>> tokenizer.load(PATH_TO_TOKENIZER_MODEL)
>>> tokenizer.vocab_size()
32000

What am I doing wrong here? And also how does this work correctly in HF?

I also manually examined the tokenizer.json file, which only includes piece_ids up to 31999.

I see through 32010 in the added_tokens key

True, but still not 32064, which means embedding size should be wrong and fail.

Microsoft org

The base tokenizer has 32000 tokens + 10 additional tokens = 32010.

The nearest multiple of 64 to 32010 is 32064, which provides massive benefits when running through Ampere or Hopper hardware: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

gugarosa changed discussion status to closed

Sign up or log in to comment