1-800-BAD-CODE
/

punct_cap_seg_47_language

@@ -60,6 +60,103 @@ This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 la
 All languages are processed with the same algorithm with no need for language tags or language-specific branches in the graph.
 This includes continuous-script and non-continuous script languages, predicting language-specific punctuation, etc.
 # Model Details
 This model generally follows the graph shown below, with brief descriptions for each step following.
@@ -136,201 +233,6 @@ This model predicts the following set of "post" punctuation tokens:
 | ¿    | Inverted question mark | Spanish |
-# Usage
-This model is released in two parts:
-1. The ONNX graph
-2. The SentencePiece tokenizer
-The following code snippet will instantiate a `SimplePCSWrapper`, which will download the model files from this repository.
-It will then run a few example sentences in a few languages, and print the processed output.
-<details>
-  <summary>Example Code</summary>
-```python
-import logging
-from sentencepiece import SentencePieceProcessor
-import onnxruntime as ort
-import numpy as np
-from huggingface_hub import hf_hub_download
-from typing import List
-class SimplePCSWrapper:
-    def __init__(self):
-        spe_path = hf_hub_download(
-            repo_id="1-800-BAD-CODE/punct_cap_seg_47_language", filename="spe_unigram_64k_lowercase_47lang.model"
-        )
-        onnx_path = hf_hub_download(
-            repo_id="1-800-BAD-CODE/punct_cap_seg_47_language", filename="punct_cap_seg_47lang.onnx"
-        )
-        self._tokenizer: SentencePieceProcessor = SentencePieceProcessor(spe_path)
-        self._ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
-        # This model has max length 128. Real code should wrap inputs; example code will truncate.
-        self._max_len = 128
-        # Hard-coding labels, for now
-        self._pre_labels = [
-            "<NULL>",
-            "¿",
-        ]
-        self._post_labels = [
-            "<NULL>",
-            ".",
-            ",",
-            "?",
-            "？",
-            "，",
-            "。",
-            "、",
-            "・",
-            "।",
-            "؟",
-            "،",
-            ";",
-            "።",
-            "፣",
-            "፧",
-        ]
-    def infer_one_text(self, text: str) -> List[str]:
-        input_ids = self._tokenizer.EncodeAsIds(text)
-        # Limit sequence to model's positional encoding limit. Leave 2 slots for BOS/EOS tags.
-        if len(input_ids) > self._max_len - 2:
-            logging.warning(f"Truncating input sequence from {len(input_ids)} to {self._max_len - 2}")
-            input_ids = input_ids[: self._max_len - 2]
-        # Append BOS and EOS.
-        input_ids = [self._tokenizer.bos_id()] + input_ids + [self._tokenizer.eos_id()]
-        # Add empty batch dimension. With real batches, sequence padding should be `self._tokenizer.pad_id()`.
-        input_ids = [input_ids]
-        # ORT input should be np.array
-        input_ids = np.array(input_ids)
-        # Get predictions.
-        pre_preds, post_preds, cap_preds, seg_preds = self._ort_session.run(None, {"input_ids": input_ids})
-        # Remove all batch dims. Remove BOS/EOS from time dim
-        pre_preds = pre_preds[0, 1:-1]
-        post_preds = post_preds[0, 1:-1]
-        cap_preds = cap_preds[0, 1:-1]
-        seg_preds = seg_preds[0, 1:-1]
-        # Apply predictions to input tokens
-        input_tokens = self._tokenizer.EncodeAsPieces(text)
-        # Segmented sentences
-        output_strings: List[str] = []
-        # Current sentence, which is built until we hit a sentence boundary prediction
-        current_chars: List[str] = []
-        for token_idx, token in enumerate(input_tokens):
-            # Simple SP decoding
-            if token.startswith("▁") and current_chars:
-                current_chars.append(" ")
-            # Skip non-printable chars
-            char_start = 1 if token.startswith("▁") else 0
-            for token_char_idx, char in enumerate(token[char_start:], start=char_start):
-                # If this is the first char in the subtoken, and we predict "pre-punct", insert it
-                if token_char_idx == char_start and pre_preds[token_idx] != 0:
-                    current_chars.append(self._pre_labels[pre_preds[token_idx]])
-                # If this char should be capitalized, apply upper case
-                if cap_preds[token_idx][token_char_idx]:
-                    char = char.upper()
-                # Append char after pre-punc and upper-casing, before post-punt
-                current_chars.append(char)
-                # If this is the final char in the subtoken, and we predict "post-punct", insert it
-                if token_char_idx == len(token) - 1 and post_preds[token_idx] != 0:
-                    current_chars.append(self._post_labels[post_preds[token_idx]])
-                # If this token is a sentence boundary, finalize the current sentence and reset
-                if token_char_idx == len(token) - 1 and seg_preds[token_idx]:
-                    output_strings.append("".join(current_chars))
-                    current_chars = []
-        return output_strings
-# Upon instantiation, will automatically download models from HF Hub
-pcs_wrapper: SimplePCSWrapper = SimplePCSWrapper()
-# Function for pretty-printing raw input and segmented output
-def print_processed_text(input_text: str, output_texts: List[str]):
-    print(f"Input: {input_text}")
-    print(f"Outputs:")
-    for text in output_texts:
-        print(f"\t{text}")
-    print()
-# Process and print each text, one at a time
-texts = [
-    "hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad",
-    "hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
-    "未來疫苗將有望覆蓋3歲以上全年齡段美國與北約軍隊已全部撤離還有鐵路公路在內的各項基建的來源都將枯竭",
-    "በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል",
-    "all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood",
-    "सभी मनुष्य जन्म से मर्यादा और अधिकारों में स्वतंत्र और समान होते हैं वे तर्क और विवेक से संपन्न हैं तथा उन्हें भ्रातृत्व की भावना से परस्पर के प्रति कार्य करना चाहिए",
-    "wszyscy ludzie rodzą się wolni i równi pod względem swej godności i swych praw są oni obdarzeni rozumem i sumieniem i powinni postępować wobec innych w duchu braterstwa",
-    "tous les êtres humains naissent libres et égaux en dignité et en droits ils sont doués de raison et de conscience et doivent agir les uns envers les autres dans un esprit de fraternité",
-]
-for text in texts:
-    outputs = pcs_wrapper.infer_one_text(text)
-    print_processed_text(text, outputs)
-```
-</details>
-<details>
-  <summary>Expected output</summary>
-  ```text
-Input: hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad
-Outputs:
-	Hola Mundo, ¿cómo estás?
-	Estamos bajo el sol y hace mucho calor.
-	Santa Coloma abre los huertos urbanos a las escuelas de la ciudad.
-Input: hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in
-Outputs:
-	Hello Friend, how's it going?
-	It's snowing outside right now.
-	In Connecticut, a large storm is moving in.
-Input: 未來疫苗將有望覆蓋3歲以上全年齡段美國與北約軍隊已全部撤離還有鐵路公路在內的各項基建的來源都將枯竭
-Outputs:
-	未來，疫苗將有望覆蓋3歲以上全年齡段。
-	美國與北約軍隊已全部撤離。
-	還有鐵路公路在內的各項基建的來源都將枯竭。
-Input: በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል
-Outputs:
-	በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድ��� ዘግቦ ነበር።
-	ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል።
-Input: all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood
-Outputs:
-	All human beings are born free and equal in dignity and rights.
-	They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
-Input: सभी मनुष्य जन्म से मर्यादा और अधिकारों में स्वतंत्र और समान होते हैं वे तर्क और विवेक से संपन्न हैं तथा उन्हें भ्रातृत्व की भावना से परस्पर के प्रति कार्य करना चाहिए
-Outputs:
-	सभी मनुष्य जन्म से मर्यादा और अधिकारों में स्वतंत्र और समान होते हैं।
-	वे तर्क और विवेक से संपन्न हैं तथा उन्हें भ्रातृत्व की भावना से परस्पर के प्रति कार्य करना चाहिए।
-Input: wszyscy ludzie rodzą się wolni i równi pod względem swej godności i swych praw są oni obdarzeni rozumem i sumieniem i powinni postępować wobec innych w duchu braterstwa
-Outputs:
-	Wszyscy ludzie rodzą się wolni i równi pod względem swej godności i swych praw.
-	Są oni obdarzeni rozumem i sumieniem i powinni postępować wobec innych w duchu braterstwa.
-Input: tous les êtres humains naissent libres et égaux en dignité et en droits ils sont doués de raison et de conscience et doivent agir les uns envers les autres dans un esprit de fraternité
-Outputs:
-	Tous les êtres humains naissent libres et égaux, en dignité et en droits.
-	Ils sont doués de raison et de conscience et doivent agir les uns envers les autres.
-	Dans un esprit de fraternité.
-  ```
-</details>
 # Training Details
 This model was trained in the NeMo framework.
@@ -346,7 +248,7 @@ Languages were chosen based on whether the News Crawl corpus contained enough re
 This model was trained on news data, and may not perform well on conversational or informal data.
 This model predicts punctuation only once per subword.
-This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
 This concession was accepted on two grounds:
 1. Such acronyms are rare, especially in the context of multi-lingual models
 2. Punctuated acronyms are typically pronounced as individual characters, e.g., 'U.S.' vs. 'NATO'.

 All languages are processed with the same algorithm with no need for language tags or language-specific branches in the graph.
 This includes continuous-script and non-continuous script languages, predicting language-specific punctuation, etc.
+# Usage
+The easy way to use this model is to install `punctuators`:
+```bash
+pip install punctuators
+```
+Running the following script should load this model and run some texts:
+<details open>
+  <summary>Example Usage</summary>
+```
+from punctuators.models import PunctCapSegModelONNX
+# Instantiate this model
+# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
+m = PunctCapSegModelONNX.from_pretrained("pcs_47lang")
+# Define some input texts to punctuate
+input_texts: List[str] = [
+    "hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad",
+    "hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
+    "未來疫苗將有望覆蓋3歲以上全年齡段美國與北約軍隊已全部撤離還有鐵路公路在內的各項基建的來源都將枯竭",
+    "በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል",
+    "all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood",
+    "सभी मनुष्य जन्म से मर्यादा और अधिकारों में स्वतंत्र और समान होते हैं वे तर्क और विवेक से संपन्न हैं तथा उन्हें भ्रातृत्व की भावना से परस्पर के प्रति कार्य करना चाहिए",
+    "wszyscy ludzie rodzą się wolni i równi pod względem swej godności i swych praw są oni obdarzeni rozumem i sumieniem i powinni postępować wobec innych w duchu braterstwa",
+    "tous les êtres humains naissent libres et égaux en dignité et en droits ils sont doués de raison et de conscience et doivent agir les uns envers les autres dans un esprit de fraternité",
+]
+results: List[List[str]] = m.infer(input_texts)
+for input_text, output_texts in zip(input_texts, results):
+    print(f"Input: {input_text}")
+    print(f"Outputs:")
+    for text in output_texts:
+        print(f"\t{text}")
+    print()
+```
+</details>
+<details open>
+  <summary>Expected Output</summary>
+```text
+Input: hola mundo cómo estás estamos bajo el sol y hace mucho calor santa coloma abre los huertos urbanos a las escuelas de la ciudad
+Outputs:
+	Hola Mundo, ¿cómo estás?
+	Estamos bajo el sol y hace mucho calor.
+	Santa Coloma abre los huertos urbanos a las escuelas de la ciudad.
+Input: hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in
+Outputs:
+	Hello Friend, how's it going?
+	It's snowing outside right now.
+	In Connecticut, a large storm is moving in.
+Input: 未來疫苗將有望覆蓋3歲以上全年齡段美國與北約軍隊已全部撤離還有鐵路公路在內的各項基建的來源都將枯竭
+Outputs:
+	未來，疫苗將有望覆蓋3歲以上全年齡段。
+	美國與北約軍隊已全部撤離。
+	還有鐵路公路在內的各項基建的來源都將枯竭。
+Input: በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል
+Outputs:
+	በባለፈው ሳምንት ኢትዮጵያ ከሶማሊያ 3 ሺህ ወታደሮቿንም እንዳስወጣች የሶማሊያው ዳልሳን ሬድዮ ዘግቦ ነበር።
+	ጸጥታ ሃይሉና ህዝቡ ተቀናጅቶ በመስራቱ በመዲናዋ ላይ የታቀደው የጥፋት ሴራ ከሽፏል።
+Input: all human beings are born free and equal in dignity and rights they are endowed with reason and conscience and should act towards one another in a spirit of brotherhood
+Outputs:
+	All human beings are born free and equal in dignity and rights.
+	They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
+Input: सभी मनुष्य जन्म से मर्यादा और अधिकारों में स्वतंत्र और समान होते हैं वे तर्क और विवेक से संपन्न हैं तथा ���न्हें भ्रातृत्व की भावना से परस्पर के प्रति कार्य करना चाहिए
+Outputs:
+	सभी मनुष्य जन्म से मर्यादा और अधिकारों में स्वतंत्र और समान होते हैं।
+	वे तर्क और विवेक से संपन्न हैं तथा उन्हें भ्रातृत्व की भावना से परस्पर के प्रति कार्य करना चाहिए।
+Input: wszyscy ludzie rodzą się wolni i równi pod względem swej godności i swych praw są oni obdarzeni rozumem i sumieniem i powinni postępować wobec innych w duchu braterstwa
+Outputs:
+	Wszyscy ludzie rodzą się wolni i równi pod względem swej godności i swych praw.
+	Są oni obdarzeni rozumem i sumieniem i powinni postępować wobec innych w duchu braterstwa.
+Input: tous les êtres humains naissent libres et égaux en dignité et en droits ils sont doués de raison et de conscience et doivent agir les uns envers les autres dans un esprit de fraternité
+Outputs:
+	Tous les êtres humains naissent libres et égaux, en dignité et en droits.
+	Ils sont doués de raison et de conscience et doivent agir les uns envers les autres.
+	Dans un esprit de fraternité.
+```
+Note that "Mundo" and "Friend" are proper nouns in this usage, which is why the model consistently upper-cases similar tokens in multiple languages.
+</details>
 # Model Details
 This model generally follows the graph shown below, with brief descriptions for each step following.
 | ¿    | Inverted question mark | Spanish |
 # Training Details
 This model was trained in the NeMo framework.
 This model was trained on news data, and may not perform well on conversational or informal data.
 This model predicts punctuation only once per subword.
+This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuated.
 This concession was accepted on two grounds:
 1. Such acronyms are rare, especially in the context of multi-lingual models
 2. Punctuated acronyms are typically pronounced as individual characters, e.g., 'U.S.' vs. 'NATO'.