1-800-BAD-CODE
commited on
Commit
·
e2feff2
1
Parent(s):
d55df9f
Update README.md
Browse files
README.md
CHANGED
@@ -11,9 +11,9 @@ tags:
|
|
11 |
---
|
12 |
|
13 |
# Model Overview
|
14 |
-
This model accepts as input lower-cased, unpunctuated
|
15 |
-
|
16 |
|
|
|
17 |
|
18 |
# Usage
|
19 |
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
@@ -65,7 +65,7 @@ Note that "Friend" in this context is a proper noun, which is why the model cons
|
|
65 |
|
66 |
# Model Details
|
67 |
|
68 |
-
This model
|
69 |
|
70 |
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)
|
71 |
|
@@ -75,30 +75,24 @@ The model begins by tokenizing the text with a subword tokenizer.
|
|
75 |
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
|
76 |
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
|
77 |
|
78 |
-
2. **
|
79 |
-
The encoded sequence is then fed into a classification network to predict
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
3. **Re-encoding**
|
84 |
-
All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
|
85 |
-
Therefore, we must conditional all further predictions on the post punctuation tokens.
|
86 |
-
For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
|
87 |
-
Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
|
88 |
-
The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
|
89 |
|
90 |
5. **Sentence boundary detection**
|
91 |
-
|
92 |
-
|
|
|
93 |
|
94 |
-
|
95 |
-
In
|
96 |
Thus, we should feed the sentence boundary information to the true-case classification network.
|
97 |
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
|
98 |
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
|
99 |
-
Concatenating this with the
|
100 |
|
101 |
-
|
102 |
Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
|
103 |
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
|
104 |
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
|
@@ -110,8 +104,8 @@ This model predicts the following set of punctuation tokens:
|
|
110 |
|
111 |
| Token | Description |
|
112 |
| ---: | :---------- |
|
113 |
-
|
|
114 |
-
|
|
115 |
| . | Latin full stop |
|
116 |
| , | Latin comma |
|
117 |
| ? | Latin question mark |
|
@@ -127,11 +121,33 @@ This model was trained in the NeMo framework.
|
|
127 |
This model was trained with News Crawl data from WMT.
|
128 |
|
129 |
Approximately 10M lines were used from the years 2021 and 2012.
|
130 |
-
The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics,
|
131 |
|
132 |
# Limitations
|
|
|
133 |
This model was trained on news data, and may not perform well on conversational or informal data.
|
134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
# Evaluation
|
137 |
In these metrics, keep in mind that
|
|
|
11 |
---
|
12 |
|
13 |
# Model Overview
|
14 |
+
This model accepts as input lower-cased, unpunctuated English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
|
|
|
15 |
|
16 |
+
In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
|
17 |
|
18 |
# Usage
|
19 |
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
|
|
65 |
|
66 |
# Model Details
|
67 |
|
68 |
+
This model implements the graph shown below, with brief descriptions for each step following.
|
69 |
|
70 |
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)
|
71 |
|
|
|
75 |
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
|
76 |
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
|
77 |
|
78 |
+
2. **Punctuation**:
|
79 |
+
The encoded sequence is then fed into a classification network to predict punctuation tokens.
|
80 |
+
Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
|
81 |
+
An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
|
83 |
5. **Sentence boundary detection**
|
84 |
+
For sentence boundary detection, we condition the model on punctuation via embeddings.
|
85 |
+
Each punctuation prediction is used to select an embedding for that token, which is concatenated to the encoded representation.
|
86 |
+
The SBD head analyzes both the encoding of the un-punctuated sequence and the puncutation predictions, and predicts which tokens are sentence boundaries.
|
87 |
|
88 |
+
7. **Shift and concat sentence boundaries**
|
89 |
+
In English, the first character of each sentence should be upper-cased.
|
90 |
Thus, we should feed the sentence boundary information to the true-case classification network.
|
91 |
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
|
92 |
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
|
93 |
+
Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
94 |
|
95 |
+
8. **True-case prediction**
|
96 |
Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
|
97 |
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
|
98 |
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
|
|
|
104 |
|
105 |
| Token | Description |
|
106 |
| ---: | :---------- |
|
107 |
+
| NULL | Predict no punctuation |
|
108 |
+
| ACRONYM | Every character in this subword ends with a period |
|
109 |
| . | Latin full stop |
|
110 |
| , | Latin comma |
|
111 |
| ? | Latin question mark |
|
|
|
121 |
This model was trained with News Crawl data from WMT.
|
122 |
|
123 |
Approximately 10M lines were used from the years 2021 and 2012.
|
124 |
+
The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, and 2021 is dominated by COVID discussions.
|
125 |
|
126 |
# Limitations
|
127 |
+
## Domain
|
128 |
This model was trained on news data, and may not perform well on conversational or informal data.
|
129 |
|
130 |
+
## Noisy Training Data
|
131 |
+
The training data was noisy, and no manual cleaning was utilized.
|
132 |
+
|
133 |
+
Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
|
134 |
+
|
135 |
+
| Token | Count |
|
136 |
+
| ---: | :---------- |
|
137 |
+
| Mr | 115232 |
|
138 |
+
| Mr. | 108212 |
|
139 |
+
|
140 |
+
| Token | Count |
|
141 |
+
| -: | :- |
|
142 |
+
| U.S. | 85324 |
|
143 |
+
| US | 37332 |
|
144 |
+
| U.S | 354 |
|
145 |
+
| U.s | 108 |
|
146 |
+
| u.S. | 65 |
|
147 |
+
| u.s | 2 |
|
148 |
+
|
149 |
+
Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
|
150 |
+
|
151 |
|
152 |
# Evaluation
|
153 |
In these metrics, keep in mind that
|