tokenizer splitting words those are in vocab

#5
by one-thing - opened

Tokenizer is splitting the words that are in vocab like
('▁विधायकों', 33821)

tokenizer.tokenize("विधायकों")
output
['▁
', 'वि', 'धा', 'य', 'कों']

Observed this with many words : बिश्नोई , एबीवीपी......

Screenshot 2023-12-16 at 8.43.05 PM.png

Sarvam AI org

Thank you for pointing this out. It seems that letting HF auto-identify the tokenizer type has some issues. Please use LlamaTokenizer instead of AutoTokenizer.

rahular changed discussion status to closed

Sign up or log in to comment