1-800-BAD-CODE commited on
Commit
651e333
1 Parent(s): 6d889b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -57,6 +57,9 @@ language:
57
  # Model Overview
58
  This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
59
 
 
 
 
60
  # Model Details
61
 
62
  This model generally follows the graph shown below, with brief descriptions for each step following.
@@ -87,14 +90,14 @@ In practice, this means the inverted question mark for Spanish and Asturian, `¿
87
  Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
88
 
89
  5. **Sentence boundary detection**
90
- Parallel to the "pre" punctuation, another classification predicts from the re-encoded text sentence boundaries.
91
  In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
92
 
93
  6. **Shift and concat sentence boundaries**
94
  In many languages, the first character of each sentence should be upper-cased.
95
- Thus, we want to feed the sentence boundary information to the true-case classification network.
96
  Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
97
- Therefore, we shift right by one the binary sentence boundary decisions.
98
  Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
99
 
100
  7. **True-case prediction**
@@ -151,12 +154,12 @@ This model was trained with News Crawl data from WMT.
151
 
152
  Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
153
 
154
- # Bias, Risks, and Limitation
155
  This model was trained on news data, and may not perform well on conversational or informal data.
156
 
157
  This is also a base-sized model with many languages and many tasks, so capacity may be limited.
158
 
159
- This model also predicts punctuation only once per subword.
160
  This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
161
  This concession was accepted on two grounds:
162
  1. Such acronyms are rare, especially in the context of multi-lingual models
 
57
  # Model Overview
58
  This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
59
 
60
+ All languages are processed with the same algorithm with no need for language tags or language-specific branches in the graph.
61
+ This includes continuous-script and non-continuous script languages, predicting language-specific punctuation, etc.
62
+
63
  # Model Details
64
 
65
  This model generally follows the graph shown below, with brief descriptions for each step following.
 
90
  Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
91
 
92
  5. **Sentence boundary detection**
93
+ Parallel to the "pre" punctuation, another classification network predicts sentence boundaries from the re-encoded text.
94
  In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
95
 
96
  6. **Shift and concat sentence boundaries**
97
  In many languages, the first character of each sentence should be upper-cased.
98
+ Thus, we should feed the sentence boundary information to the true-case classification network.
99
  Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
100
+ Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
101
  Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
102
 
103
  7. **True-case prediction**
 
154
 
155
  Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
156
 
157
+ # Limitations
158
  This model was trained on news data, and may not perform well on conversational or informal data.
159
 
160
  This is also a base-sized model with many languages and many tasks, so capacity may be limited.
161
 
162
+ This model predicts punctuation only once per subword.
163
  This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
164
  This concession was accepted on two grounds:
165
  1. Such acronyms are rare, especially in the context of multi-lingual models