Nonspacing marks by themselves causing problems for the tokenizer

#2
by AngledLuffa - opened

I ran into what I believe is a minor problem with the tokenizer for indic-bert. I was looking at the L3Cube NER dataset:

https://github.com/l3cube-pune/MarathiNLP

In the train section of the NER dataset is the following sentence (the numbers represent sentence number):

या      O       17197.0
मंत्राची  O       17197.0
देवता    O       17197.0
गणपती   O       17197.0
 ँ       O       17197.0
हा      O       17197.0
तो      O       17197.0
मंत्र     O       17197.0

The fifth "word" appears to be a non-spacing Candrabindu mark by itself. If I feed the words to the indic-bert tokenizer word by word, I would expect it to produce or something similar for an untokenizable word such as that. Instead, it produces nothing. Is that expected behavior I should compensate for, or is it something that can be fixed in the tokenizer?

Thanks again!

Sign up or log in to comment