Question about tokenizer

#3
by freQuensy23 - opened

I've tried to use your model and can't understand some behaviour of its tokenizer:
tokenizer('1') returns 1, 29871, 29896.
1 is BOS token, 29896 is '1' token, but what does 29871 means???

When I decoded it back into string, i get: tokenizer.decode([29871]) = '' (empty string)
Can you explain me the purpuse of adding empty string to tokenizer's vocab?

Sign up or log in to comment