1-800-BAD-CODE
commited on
Commit
·
e7a5edc
1
Parent(s):
e2feff2
Update README.md
Browse files
README.md
CHANGED
@@ -15,6 +15,7 @@ This model accepts as input lower-cased, unpunctuated English text and performs
|
|
15 |
|
16 |
In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
|
17 |
|
|
|
18 |
# Usage
|
19 |
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
20 |
|
@@ -22,6 +23,7 @@ The easy way to use this model is to install [punctuators](https://github.com/1-
|
|
22 |
pip install punctuators
|
23 |
```
|
24 |
|
|
|
25 |
Running the following script should load this model and run some texts:
|
26 |
<details open>
|
27 |
|
@@ -99,6 +101,10 @@ Since true-casing should be done on a per-character basis, the classification ne
|
|
99 |
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
|
100 |
|
101 |
|
|
|
|
|
|
|
|
|
102 |
## Punctuation Tokens
|
103 |
This model predicts the following set of punctuation tokens:
|
104 |
|
@@ -133,7 +139,7 @@ The training data was noisy, and no manual cleaning was utilized.
|
|
133 |
Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
|
134 |
|
135 |
| Token | Count |
|
136 |
-
|
|
137 |
| Mr | 115232 |
|
138 |
| Mr. | 108212 |
|
139 |
|
@@ -153,7 +159,7 @@ Thus, the model's acronym and abbreviation predictions may be a bit unpredictabl
|
|
153 |
In these metrics, keep in mind that
|
154 |
1. The data is noisy
|
155 |
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
|
156 |
-
When conditioning on reference punctuation, true-casing and SBD
|
157 |
4. Punctuation can be subjective. E.g.,
|
158 |
|
159 |
`Hello Frank, how's it going?`
|
|
|
15 |
|
16 |
In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
|
17 |
|
18 |
+
|
19 |
# Usage
|
20 |
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
21 |
|
|
|
23 |
pip install punctuators
|
24 |
```
|
25 |
|
26 |
+
|
27 |
Running the following script should load this model and run some texts:
|
28 |
<details open>
|
29 |
|
|
|
101 |
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
|
102 |
|
103 |
|
104 |
+
The model's maximum length is 256 subtokens. However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
|
105 |
+
as described above will transparently predict on overlapping subgsegments of longer input texts and fuse the results before returning output,
|
106 |
+
allowing inputs to be arbitrarily long.
|
107 |
+
|
108 |
## Punctuation Tokens
|
109 |
This model predicts the following set of punctuation tokens:
|
110 |
|
|
|
139 |
Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
|
140 |
|
141 |
| Token | Count |
|
142 |
+
| -: | :- |
|
143 |
| Mr | 115232 |
|
144 |
| Mr. | 108212 |
|
145 |
|
|
|
159 |
In these metrics, keep in mind that
|
160 |
1. The data is noisy
|
161 |
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
|
162 |
+
When conditioning on reference punctuation, true-casing and SBD metrics are much higher w.r.t. the reference targets.
|
163 |
4. Punctuation can be subjective. E.g.,
|
164 |
|
165 |
`Hello Frank, how's it going?`
|