several concerns about tokenizer and model embedding
#41
by
yinanxu
- opened
1, Size of model embedding and tokenizer mismatch.
- model embedding size = 151936
- tokenizer size = 151851
What are those extra (151936-151851)=85 tokens? Are those extra tokens used?
2, No explicit special tokens in tokenizer. I guess
- bos = <|im_start|>
- eos = <|im_end|>
but what is the pad token?
- The embedding size is padded (multiples of 128) to improve computation efficiency for devices with tensor cores. The padded are not used.
- Please refer to our docs here: https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md There is generally no need to set those tokens with our code.
jklj077
changed discussion status to
closed
Isn't the number being discussed the size of the vocab, not the embedding dimensionality of tokens?