1-800-BAD-CODE/punctuation_fullstop_truecase_english · Multiple sentences appear as one element in the results list

Aug 27, 2023

Sometimes, multiple correctly segmented sentences appear as one element in the results list. Example:

Input:
after all isn't that what we're here for that's why we're here for well rob you are an inspiration for the nation thank you for coming on thank you for having me thank you so much for watching this week's episode

Output:
After all, isn't that what we're here for? \n That's why we're here for, well, Rob, you are an inspiration for the nation. Thank you for coming on, Thank you for having me. Thank you so much for watching this week's episode.

Note how there is no \n before the sentences beginning with "Thank you".

1-800-BAD-CODE

Owner Aug 27, 2023

Seems like this particular input is confusing the sentence boundary detection (full stop prediction) head. Note how the first instance of thank was capitalized after the comma, too. The model seems confused as to where the sentences end (likely due to repeated texts which are unusual of the training data).

I would recommend the better model, https://huggingface.co/1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase.

This is the output I get with that one:

After all, isn't that what we're here for? \n That's why we're here for. \n Well, Rob, you are an inspiration for the nation. \n Thank you for coming on, thank you for having me. \n Thank you so much for watching this week's episode.

drmeir

Aug 28, 2023

That model seems to suffer from the opposite problem, where by partial sentences appear as elements of the output. I have submitted this issue in that model:
https://huggingface.co/1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase/discussions/1