several concerns about tokenizer and model embedding

#41

by yinanxu - opened Sep 27, 2023

Sep 27, 2023

•

1, Size of model embedding and tokenizer mismatch.

model embedding size = 151936
tokenizer size = 151851
What are those extra (151936-151851)=85 tokens? Are those extra tokens used?

2, No explicit special tokens in tokenizer. I guess

Qwen org Sep 28, 2023

The embedding size is padded (multiples of 128) to improve computation efficiency for devices with tensor cores. The padded are not used.
Please refer to our docs here: https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md There is generally no need to set those tokens with our code.

jklj077 changed discussion status to closed Oct 9, 2023

nmasi

Feb 5

Isn't the number being discussed the size of the vocab, not the embedding dimensionality of tokens?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment