1-800-BAD-CODE commited on
Commit
b688cde
1 Parent(s): 2186d4b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -67
README.md CHANGED
@@ -67,71 +67,6 @@ This model segments a long, punctuated text into one or more constituent sentenc
67
  A key feature is that the model is multi-lingual and language-agnostic at inference time.
68
  Therefore, language tags do not need to be used and a single batch can contain multiple languages.
69
 
70
- ## Architecture
71
- This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.
72
-
73
- Given that this is a relatively-easy NLP task, the model contains \~9M parameters (\~8.2M of which are embeddings).
74
- This makes the model very fast and cheap at inference time, as SBD should be.
75
-
76
- The BERT encoder is based on the following configuration:
77
-
78
- * 8 heads
79
- * 4 layers
80
- * 128 hidden dim
81
- * 512 intermediate/ff dim
82
- * 64000 embeddings/vocab tokens
83
-
84
- ## Training
85
- This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
86
-
87
- Model was trained for several hundred thousand steps with \~1M lines of texts per language (\~49M lines total) with a global batch size of 256 examples. Batches were multilingual and generated by randomly sampling each language.
88
-
89
- ### Training Data
90
- This model was trained on `OpenSubtitles` data.
91
-
92
- Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.
93
-
94
- Automatically-segmented corpora are undesirable for at least two reasons:
95
-
96
- 1. The data-driven model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
97
- 2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).
98
-
99
- Heuristics were used to attempt to clean the data before training.
100
- Some examples of the cleaning are:
101
-
102
- * Drop sentences which start with a lower-case letter. Assume these lines are errorful.
103
- * For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
104
- * Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
105
- * Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
106
- * Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).
107
-
108
- ### Training Example Generation
109
- To create examples for the model, we
110
-
111
- 1. Assume each input line is exactly one sentence
112
- 2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
113
-
114
- For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
115
- The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.
116
-
117
- This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
118
- If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
119
-
120
- 50% of input texts were lower-cased for both the tokenizer and classification models.
121
- This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing.
122
- Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.
123
-
124
- ### Language Specific Rules
125
- The training data was pre-processed for language-specific punctuation and spacing rules.
126
-
127
- The following guidelines were used during training. If inference inputs differ, the model may perform poorly.
128
-
129
- * All spaces were removed from continuous-script languages (Chinese, Japanese).
130
- * Chinese: Chinese and Japanese use full-width periods "。", question marks "?", and commas ",".
131
- * Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
132
- * Arabic: Uses reverse question marks "؟", not "?".
133
-
134
-
135
  # Model Inputs and Outputs
136
  The model inputs should be **punctuated** texts.
137
 
@@ -143,7 +78,6 @@ Optimal handling of longer sequences would require some inference-time logic (wr
143
 
144
  For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
145
 
146
-
147
  # Example Usage
148
 
149
  This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
@@ -170,7 +104,7 @@ tokenizer: SentencePieceProcessor = SentencePieceProcessor(spe_path)
170
  ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
171
  ```
172
 
173
- Next, let's define a simple function that runs inference on a single sentence and prints the predictions:
174
 
175
  ```python
176
  def run_infer(text: str, threshold: float = 0.5):
@@ -270,6 +204,71 @@ Input: szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi r
270
  Sentence 2: ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
271
  ```
272
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
273
  # Limitations and known issues
274
 
275
  ## Noisy training data
 
67
  A key feature is that the model is multi-lingual and language-agnostic at inference time.
68
  Therefore, language tags do not need to be used and a single batch can contain multiple languages.
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  # Model Inputs and Outputs
71
  The model inputs should be **punctuated** texts.
72
 
 
78
 
79
  For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
80
 
 
81
  # Example Usage
82
 
83
  This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
 
104
  ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
105
  ```
106
 
107
+ Next, let's define a simple function that runs inference on one text input and prints the predictions:
108
 
109
  ```python
110
  def run_infer(text: str, threshold: float = 0.5):
 
204
  Sentence 2: ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
205
  ```
206
 
207
+ # Model Architecture
208
+ This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.
209
+
210
+ Given that this is a relatively-easy NLP task, the model contains \~9M parameters (\~8.2M of which are embeddings).
211
+ This makes the model very fast and cheap at inference time, as SBD should be.
212
+
213
+ The BERT encoder is based on the following configuration:
214
+
215
+ * 8 heads
216
+ * 4 layers
217
+ * 128 hidden dim
218
+ * 512 intermediate/ff dim
219
+ * 64000 embeddings/vocab tokens
220
+
221
+ # Model Training
222
+ This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
223
+
224
+ Model was trained for several hundred thousand steps with \~1M lines of texts per language (\~49M lines total) with a global batch size of 256 examples. Batches were multilingual and generated by randomly sampling each language.
225
+
226
+ ## Training Data
227
+ This model was trained on `OpenSubtitles` data.
228
+
229
+ Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.
230
+
231
+ Automatically-segmented corpora are undesirable for at least two reasons:
232
+
233
+ 1. The data-driven model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
234
+ 2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).
235
+
236
+ Heuristics were used to attempt to clean the data before training.
237
+ Some examples of the cleaning are:
238
+
239
+ * Drop sentences which start with a lower-case letter. Assume these lines are errorful.
240
+ * For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
241
+ * Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
242
+ * Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
243
+ * Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).
244
+
245
+ ### Training Example Generation
246
+ To create examples for the model, we
247
+
248
+ 1. Assume each input line is exactly one sentence
249
+ 2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
250
+
251
+ For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
252
+ The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.
253
+
254
+ This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
255
+ If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
256
+
257
+ 50% of input texts were lower-cased for both the tokenizer and classification models.
258
+ This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing.
259
+ Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.
260
+
261
+ ### Language Specific Rules
262
+ The training data was pre-processed for language-specific punctuation and spacing rules.
263
+
264
+ The following guidelines were used during training. If inference inputs differ, the model may perform poorly.
265
+
266
+ * All spaces were removed from continuous-script languages (Chinese, Japanese).
267
+ * Chinese: Chinese and Japanese use full-width periods "。", question marks "?", and commas ",".
268
+ * Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
269
+ * Arabic: Uses reverse question marks "؟", not "?".
270
+
271
+
272
  # Limitations and known issues
273
 
274
  ## Noisy training data