Sapce and Newline in same token
#13
by
Kyriota
- opened
model_id = "lmsys/fastchat-t5-3b-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=False)
tokenizer.encode(' ') == tokenizer.encode('\n')
>>> True
I've seen space and newline in 'added_tokens.json',
they should be 32106 and 32103 separatly.
But in my code, they are the same token.
I'm wondering why.