Multiple sentences appear as one element in the results list

#5
by drmeir - opened

Sometimes, multiple correctly segmented sentences appear as one element in the results list. Example:

Input:
after all isn't that what we're here for that's why we're here for well rob you are an inspiration for the nation thank you for coming on thank you for having me thank you so much for watching this week's episode

Output:
After all, isn't that what we're here for? \n That's why we're here for, well, Rob, you are an inspiration for the nation. Thank you for coming on, Thank you for having me. Thank you so much for watching this week's episode.

Note how there is no \n before the sentences beginning with "Thank you".

Seems like this particular input is confusing the sentence boundary detection (full stop prediction) head. Note how the first instance of thank was capitalized after the comma, too. The model seems confused as to where the sentences end (likely due to repeated texts which are unusual of the training data).

I would recommend the better model, https://huggingface.co/1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase.

This is the output I get with that one:

After all, isn't that what we're here for? \n That's why we're here for. \n Well, Rob, you are an inspiration for the nation. \n Thank you for coming on, thank you for having me. \n Thank you so much for watching this week's episode.

That model seems to suffer from the opposite problem, where by partial sentences appear as elements of the output. I have submitted this issue in that model:
https://huggingface.co/1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase/discussions/1

Sign up or log in to comment