Qwen
/

several concerns about tokenizer and model embedding

#41
by yinanxu - opened

1, Size of model embedding and tokenizer mismatch.

  • model embedding size = 151936
  • tokenizer size = 151851
    What are those extra (151936-151851)=85 tokens? Are those extra tokens used?

2, No explicit special tokens in tokenizer. I guess

  • bos = <|im_start|>
  • eos = <|im_end|>
    but what is the pad token?
  1. The embedding size is padded (multiples of 128) to improve computation efficiency for devices with tensor cores. The padded are not used.
  2. Please refer to our docs here: https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md There is generally no need to set those tokens with our code.
jklj077 changed discussion status to closed

Isn't the number being discussed the size of the vocab, not the embedding dimensionality of tokens?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment