File size: 6,672 Bytes
bba32de
 
d55df9f
 
 
 
 
 
 
 
bba32de
d55df9f
 
e2feff2
d55df9f
e2feff2
d55df9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2feff2
d55df9f
 
 
 
 
 
 
 
 
e2feff2
 
 
 
d55df9f
 
e2feff2
 
 
d55df9f
e2feff2
 
d55df9f
 
 
e2feff2
d55df9f
e2feff2
d55df9f
 
 
 
 
 
 
 
 
 
 
e2feff2
 
d55df9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2feff2
d55df9f
 
e2feff2
d55df9f
 
e2feff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d55df9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
license: apache-2.0
language:
- en
tags:
- punctuation
- true casing
- sentence boundary detection
- token classification
- nlp
---

# Model Overview
This model accepts as input lower-cased, unpunctuated English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).

In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.

# Usage
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):

```bash
pip install punctuators
```

Running the following script should load this model and run some texts:
<details open>

  <summary>Example Usage</summary>

```
from punctuators.models import PunctCapSegModelONNX

# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = PunctCapSegModelONNX.from_pretrained("pcs_en")

# Define some input texts to punctuate
input_texts: List[str] = [
    "hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
    "i live in the us where george hw bush was once president"
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    print(f"Outputs:")
    for text in output_texts:
        print(f"\t{text}")
    print()

```

</details>

<details open>

  <summary>Expected Output</summary>

```text

```

Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.

</details>
    
# Model Details

This model implements the graph shown below, with brief descriptions for each step following.

![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)


1. **Encoding**:
The model begins by tokenizing the text with a subword tokenizer.
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.

2. **Punctuation**:
The encoded sequence is then fed into a classification network to predict punctuation tokens. 
Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.

5. **Sentence boundary detection**
For sentence boundary detection, we condition the model on punctuation via embeddings.
Each punctuation prediction is used to select an embedding for that token, which is concatenated to the encoded representation.
The SBD head analyzes both the encoding of the un-punctuated sequence and the puncutation predictions, and predicts which tokens are sentence boundaries. 

7. **Shift and concat sentence boundaries**
In English, the first character of each sentence should be upper-cased.
Thus, we should feed the sentence boundary information to the true-case classification network.
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.

8. **True-case prediction**
Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".


## Punctuation Tokens
This model predicts the following set of punctuation tokens:

| Token  | Description |
| ---: | :---------- |
| NULL    | Predict no punctuation |
| ACRONYM    | Every character in this subword ends with a period |
| .    | Latin full stop |
| ,    | Latin comma | 
| ?    | Latin question mark |





# Training Details
This model was trained in the NeMo framework.

## Training Data
This model was trained with News Crawl data from WMT.

Approximately 10M lines were used from the years 2021 and 2012. 
The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, and 2021 is dominated by COVID discussions.

# Limitations
## Domain
This model was trained on news data, and may not perform well on conversational or informal data.

## Noisy Training Data
The training data was noisy, and no manual cleaning was utilized.

Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.

| Token  | Count |
| ---: | :---------- |
| Mr    | 115232 |
| Mr.    | 108212 |

| Token  | Count |
| -: | :- |
| U.S.    | 85324 |
| US    | 37332 |
| U.S | 354 |
| U.s | 108 |
| u.S. | 65 |
| u.s | 2 |

Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.


# Evaluation
In these metrics, keep in mind that
1. The data is noisy
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
   When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
4. Punctuation can be subjective. E.g.,
   
   `Hello Frank, how's it going?`
   
   or

   `Hello Frank. How's it going?`

   When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.

## Test Data and Example Generation
Each test example was generated using the following procedure:

1. Concatenate 10 random sentences
2. Lower-case the concatenated sentence
3. Remove all punctuation

The data is a held-out portion of News Crawl, which has been deduplicated. 
2,000 lines of data was used, generating 2,000 unique examples of 10 sentences each.

Examples longer than the model's maximum length (256) were truncated. 
The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.