Ġ in tokenizer

#20
by Sm1Ling - opened

Why are there so many characters "Ġ" in tokenizer?

https://ai.stackexchange.com/questions/45054/why-do-llm-tokenizers-use-a-special-symbol-for-space-such-as-%C4%A0-in-bpe-or-in-sp

my understanding is that this character simply indicates the beginning of a word. I think its presence improves model's behavior around word boundaries.

I appreciate your answer a lot!

Sm1Ling changed discussion status to closed

Sign up or log in to comment