1-800-BAD-CODE commited on
Commit
23dc508
1 Parent(s): c224a7e

make model card simpler

Browse files
Files changed (1) hide show
  1. README.md +22 -0
README.md CHANGED
@@ -69,8 +69,17 @@ and detect sentence boundaries (full stops) in 47 languages.
69
 
70
  # Usage
71
 
 
 
 
 
72
  ## Usage via `punctuators` package
73
 
 
 
 
 
 
74
  The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
75
 
76
  ```bash
@@ -180,6 +189,7 @@ Outputs:
180
 
181
  </details>
182
 
 
183
 
184
  ## Manual Usage
185
  If you want to use the ONNX and SP models without wrappers, see the following example.
@@ -305,11 +315,16 @@ Outputs:
305
 
306
 
307
  # Model Architecture
 
308
  This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
309
  in every language without language-specific behavior:
310
 
311
  ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
312
 
 
 
 
 
313
  We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
314
 
315
  Then we predict punctuation before and after every subtoken.
@@ -330,8 +345,14 @@ modeled as a multi-label problem. This allows for upper-casing arbitrary charact
330
 
331
  Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
332
 
 
 
333
  ## Tokenizer
334
 
 
 
 
 
335
  Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the `xlm-roberta` SentencePiece model was adjusted to correctly encode
336
  the text. Per HF's comments,
337
 
@@ -373,6 +394,7 @@ with open("/path/to/new/sp.model", "wb") as f:
373
 
374
  Now we can use just the SP model without a wrapper.
375
 
 
376
 
377
  ## Post-Punctuation Tokens
378
  This model predicts the following set of punctuation tokens after each subtoken:
 
69
 
70
  # Usage
71
 
72
+ If you want to just play with the model, the widget on this page will suffice. To use the model offline,
73
+ the following snippets show how to use the model both with a wrapper (that I wrote, available from `PyPI`)
74
+ and manual usuage (using the ONNX and SentencePiece models in this repo).
75
+
76
  ## Usage via `punctuators` package
77
 
78
+
79
+ <details>
80
+
81
+ <summary>Click to see usage with wrappers</summary>
82
+
83
  The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
84
 
85
  ```bash
 
189
 
190
  </details>
191
 
192
+ </details>
193
 
194
  ## Manual Usage
195
  If you want to use the ONNX and SP models without wrappers, see the following example.
 
315
 
316
 
317
  # Model Architecture
318
+
319
  This model implements the following graph, which allows punctuation, true-casing, and fullstop prediction
320
  in every language without language-specific behavior:
321
 
322
  ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/62d34c813eebd640a4f97587/jpr-pMdv6iHxnjbN4QNt0.png)
323
 
324
+ <details>
325
+
326
+ <summary>Click to see graph explanations</summary>
327
+
328
  We start by tokenizing the text and encoding it with XLM-Roberta, which is the pre-trained portion of this graph.
329
 
330
  Then we predict punctuation before and after every subtoken.
 
345
 
346
  Applying all these predictions to the input text, we can punctuate, true-case, and split sentences in any language.
347
 
348
+ </details>
349
+
350
  ## Tokenizer
351
 
352
+ <details>
353
+
354
+ <summary>Click to see how the XLM-Roberta tokenizer was un-hacked</summary>
355
+
356
  Instead of the hacky wrapper used by FairSeq and strangely ported (not fixed) by HuggingFace, the `xlm-roberta` SentencePiece model was adjusted to correctly encode
357
  the text. Per HF's comments,
358
 
 
394
 
395
  Now we can use just the SP model without a wrapper.
396
 
397
+ </details>
398
 
399
  ## Post-Punctuation Tokens
400
  This model predicts the following set of punctuation tokens after each subtoken: