About llama3 tokenizer
#146
by
Yingshu
- opened
I use llama3 tokenizer, I found one issue.
I tokenize one string, 'help . and'
llama3_tokenizer('help . and')
I got {'input ids':[128000,8823, 662, 323],'attention mask':[1,1,1,1]}
If I decode the input ids,
llama3_tokenizer.decode([8823,662,323])
I got 'help. and',
Why I lose one space after help?
If I use llama2 tokenizer, I can get the same string.
llama2_tokenizer('help . and")
{'input ids':[1, 1371, 869,322],'attention mask': [1,1,1,1]}
llama2_tokenizer.decode([1371,869,322])
'help . and"