microsoft
/

Phi-3-mini-4k-instruct

Text Generation

text-generation-inference

Model card Files Files and versions

Loading SPM tokenizer shows 32000 vocab size instead of 32064

#47

by jrc - opened Apr 29, 2024

jrc

Apr 29, 2024

The Phi3 paper claims that the Phi3 Mini sentencepiece tokenizer has a vocab size of 32064. However, when I load the tokenizer using the following code, I see that the vocab size for the saved model is only 32000.

>>> from sentencepiece import SentencePieceProcessor
>>> tokenizer = SentencePieceProcessor()
>>> tokenizer.load(PATH_TO_TOKENIZER_MODEL)
>>> tokenizer.vocab_size()
32000

What am I doing wrong here? And also how does this work correctly in HF?

jrc

Apr 29, 2024

I also manually examined the tokenizer.json file, which only includes piece_ids up to 31999.

jpohhhh

Apr 29, 2024

I see through 32010 in the added_tokens key

jrc

Apr 30, 2024

True, but still not 32064, which means embedding size should be wrong and fail.

Microsoft org May 1, 2024

The base tokenizer has 32000 tokens + 10 additional tokens = 32010.

The nearest multiple of 64 to 32010 is 32064, which provides massive benefits when running through Ampere or Hopper hardware: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

gugarosa changed discussion status to closed May 1, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment