Partial sentences in the output list

#1
by drmeir - opened

The output list may contain partial sentences, which end in a comma. Example:

Input:
and they are coming home it's like you know i know that that's why god put me on this earth that's what i'm doing did you have to train yourself or by 16 you were like ready to go and start arguing and fighting and defending i'll tell you what happened so we're on the street 15 i think 47th street they're pacing on posters and i'm it's the middle of the night my teacher have a saturday night and i began to argue with them it was an argument that was going so nowhere so fast

Output:
And they are coming home. \n It's like, you know, I know that, that's why God put me on this earth, that's what I'm doing. \n Did you have to train yourself? \n Or, \n by 16, you were like, ready to go and start arguing and fighting and defending. \n I'll tell you what happened. \n So we're on the Street 15, I think 47th Street, they're pacing on posters and I'm it's the middle of the night. \n My teacher have a Saturday night and I began to argue with them. \n It was an argument that was going so nowhere, so fast.

Note the sentence "Or,".

Looks like a sentence boundary detection error. In a one-pass model like this (i.e., the model predicts SBD and punctuation with one pass through the Transformer), SBD can have surprising failures.

Though on news data this model is nearly perfect w.r.t. SBD, I'm aware that this model shows a large difference in SBD performance between in-domain (i.e., news data) and out-of-domain (conversational, etc.), which is surprising given the simplicity of SBD. To account for this, this model was exported with a fixed internal SBD threshold of 0.05 ("5%"). This seems to have made it much better on informal data, such as the input in question, but still there may be some errors.

I'm working to improve SBD on out-of-domain data and will release a new model soon, with an additional capability as well (predicting "merge" tokens, e.g., the f b i agent -> The FBI agent.).

Sign up or log in to comment