1-800-BAD-CODE
/

punct_cap_seg_47_language

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on Feb 22, 2023

Commit

2f391ad

•

1 Parent(s): 09527b1

Update README.md

Files changed (1) hide show

README.md +5 -1

README.md CHANGED Viewed

@@ -157,7 +157,6 @@ Languages were chosen based on whether the News Crawl corpus contained enough re
 # Limitations
 This model was trained on news data, and may not perform well on conversational or informal data.
-This is also a base-sized model with many languages and many tasks, so capacity may be limited.
 This model predicts punctuation only once per subword.
 This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
@@ -167,4 +166,9 @@ This concession was accepted on two grounds:
    Since the expected use-case of this model is the output of an ASR system, it is presumed that such
    pronunciations would be transcribed as separate tokens, e.g, 'u s' vs. 'us' (though this depends on the model's pre-processing).
 # Evaluation

 # Limitations
 This model was trained on news data, and may not perform well on conversational or informal data.
 This model predicts punctuation only once per subword.
 This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
    Since the expected use-case of this model is the output of an ASR system, it is presumed that such
    pronunciations would be transcribed as separate tokens, e.g, 'u s' vs. 'us' (though this depends on the model's pre-processing).
+Further, this model is unlikely to be of production quality.
+Though trained to convergence, it was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data.
+This is also a base-sized model with many languages and many tasks, so capacity may be limited.
 # Evaluation