My result has unnecessarily split words from my input along with some extra characters.

#5
by sukumar-rtg - opened

The problem is that when i input a sentence into the model the result I receive is not for the same words I had in the sentence. Often the words are split into two words and there is an extra "##" in front of the second half of the word.

for example

Input:

Modi increased the education budget by 20

Output:

image.png

For the First word itself "Modi" The word was split into "Mod" and "##i" with those unnecessary "##". I am entering the sentence I have given above but I can not understand the reason why I am receiving the result like this. My guess is that it is most likely not an issue from my end but I am open to the suggestions if I am doing something wrong.

Arabic Language Technologies, Qatar Computing Research Institute org

Hello,

Modern language models use subwords to represent words that are not directly in their vocabulary. In BERT's case, this is word piece tokenization (https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt). From a practical point of view, what you can do is pick either the first of the tokenized word (so "Mod" for "Modi"), or do some sort of majority voting over all subtokens. The subtokens will always be of the form word -> subword1 ##subword2 ##subword3.

Hope this helps!

sukumar-rtg changed discussion status to closed

Sign up or log in to comment