Wrong BOS and EOS tokens in tokenizer.model file

#12
by sszymczyk - opened

In tokenizer.model file BOS (token 31998) is "弘", while EOS (token 31999) is "给" instead of respectively "<|im_start|>" and "<|im_end|>"

Snowflake org

Good observation! We made the decision to reuse the last couple tokens in the vocabulary to represent the ChatML <|im_start|> and <|im_end|> tokens so that the turn delimiters could be assigned dedicated tokens without having to increase the size of the model's embedding matrix. Our tokenizer should correctly encoded the ChatML delimiters to these last two IDs and decode the IDs back to these turn tags. Please let us know if you have any additional questions.

sszymczyk changed discussion status to closed

Sign up or log in to comment