1-800-BAD-CODE
/

sentence_boundary_detection_multilang

ONNX

NeMo

sentence boundary detection

token classification

nlp

Model card Files Files and versions Community

1-800-BAD-CODE commited on Jan 13, 2023

Commit

b688cde

1 Parent(s): 2186d4b

Update README.md

Browse files

Files changed (1) hide show

README.md +66 -67

README.md CHANGED Viewed

@@ -67,71 +67,6 @@ This model segments a long, punctuated text into one or more constituent sentenc
 A key feature is that the model is multi-lingual and language-agnostic at inference time.
 Therefore, language tags do not need to be used and a single batch can contain multiple languages.
-## Architecture
-This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.
-Given that this is a relatively-easy NLP task, the model contains \~9M parameters (\~8.2M of which are embeddings).
-This makes the model very fast and cheap at inference time, as SBD should be.
-The BERT encoder is based on the following configuration:
-* 8 heads
-* 4 layers
-* 128 hidden dim
-* 512 intermediate/ff dim
-* 64000 embeddings/vocab tokens
-## Training
-This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
-Model was trained for several hundred thousand steps with \~1M lines of texts per language (\~49M lines total) with a global batch size of 256 examples. Batches were multilingual and generated by randomly sampling each language.
-### Training Data
-This model was trained on `OpenSubtitles` data.
-Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.
-Automatically-segmented corpora are undesirable for at least two reasons:
-1. The data-driven model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
-2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).
-Heuristics were used to attempt to clean the data before training.
-Some examples of the cleaning are:
-* Drop sentences which start with a lower-case letter. Assume these lines are errorful.
-* For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
-* Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
-* Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
-* Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).
-### Training Example Generation
-To create examples for the model, we
-1. Assume each input line is exactly one sentence
-2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
-For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
-The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.
-This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
-If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
-50% of input texts were lower-cased for both the tokenizer and classification models.
-This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing.
-Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.
-### Language Specific Rules
-The training data was pre-processed for language-specific punctuation and spacing rules.
-The following guidelines were used during training. If inference inputs differ, the model may perform poorly.
-* All spaces were removed from continuous-script languages (Chinese, Japanese).
-* Chinese: Chinese and Japanese use full-width periods "。", question marks "？", and commas "，".
-* Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
-* Arabic: Uses reverse question marks "؟", not "?".
 # Model Inputs and Outputs
 The model inputs should be **punctuated** texts.
@@ -143,7 +78,6 @@ Optimal handling of longer sequences would require some inference-time logic (wr
 For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
 # Example Usage
 This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
@@ -170,7 +104,7 @@ tokenizer: SentencePieceProcessor = SentencePieceProcessor(spe_path)
 ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
 ```
-Next, let's define a simple function that runs inference on a single sentence and prints the predictions:
 ```python
 def run_infer(text: str, threshold: float = 0.5):
@@ -270,6 +204,71 @@ Input: szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi r
         Sentence 2: ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
 ```
 # Limitations and known issues
 ## Noisy training data

 A key feature is that the model is multi-lingual and language-agnostic at inference time.
 Therefore, language tags do not need to be used and a single batch can contain multiple languages.
 # Model Inputs and Outputs
 The model inputs should be **punctuated** texts.
 For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
 # Example Usage
 This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
 ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
 ```
+Next, let's define a simple function that runs inference on one text input and prints the predictions:
 ```python
 def run_infer(text: str, threshold: float = 0.5):
         Sentence 2: ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
 ```
+# Model Architecture
+This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.
+Given that this is a relatively-easy NLP task, the model contains \~9M parameters (\~8.2M of which are embeddings).
+This makes the model very fast and cheap at inference time, as SBD should be.
+The BERT encoder is based on the following configuration:
+* 8 heads
+* 4 layers
+* 128 hidden dim
+* 512 intermediate/ff dim
+* 64000 embeddings/vocab tokens
+# Model Training
+This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
+Model was trained for several hundred thousand steps with \~1M lines of texts per language (\~49M lines total) with a global batch size of 256 examples. Batches were multilingual and generated by randomly sampling each language.
+## Training Data
+This model was trained on `OpenSubtitles` data.
+Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.
+Automatically-segmented corpora are undesirable for at least two reasons:
+1. The data-driven model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
+2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).
+Heuristics were used to attempt to clean the data before training.
+Some examples of the cleaning are:
+* Drop sentences which start with a lower-case letter. Assume these lines are errorful.
+* For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
+* Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
+* Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
+* Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).
+### Training Example Generation
+To create examples for the model, we
+1. Assume each input line is exactly one sentence
+2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
+For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
+The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.
+This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
+If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
+50% of input texts were lower-cased for both the tokenizer and classification models.
+This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing.
+Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.
+### Language Specific Rules
+The training data was pre-processed for language-specific punctuation and spacing rules.
+The following guidelines were used during training. If inference inputs differ, the model may perform poorly.
+* All spaces were removed from continuous-script languages (Chinese, Japanese).
+* Chinese: Chinese and Japanese use full-width periods "。", question marks "？", and commas "，".
+* Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
+* Arabic: Uses reverse question marks "؟", not "?".
 # Limitations and known issues
 ## Noisy training data