1-800-BAD-CODE
/

punctuation_fullstop_truecase_romance

@@ -66,10 +66,15 @@ from punctuators.models import PunctCapSegModelONNX
 # This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
 m = PunctCapSegModelONNX.from_pretrained("pcs_romance")
-# Define some input texts to punctuate
 input_texts: List[str] = [
     "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
     "hola amigo cómo estás es un día lluvioso hoy",
 ]
 results: List[List[str]] = m.infer(input_texts)
 for input_text, output_texts in zip(input_texts, results):
@@ -119,20 +124,20 @@ This is accomplished behind the scenes by splitting the input into overlapping s
 If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
 # Contact
-Contact me at shane.carroll@utsa.edu with requests or issues, or on the community tab.
 # Metrics
 Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
-Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all text.
-Since punctuation is sometimes subjective (e.g., see the example outputs above) punctuation metrics can be misleading.
-Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a support of 50 for "¿" which should not appear).
 Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
 we predict it separate from the other punctuation tokens.
-Generally, acronyms are rare and difficult, periods are easy, commas are a bit hard, and question marks are hard.
 Expand any of the following tabs to see metrics for that language.

 # This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
 m = PunctCapSegModelONNX.from_pretrained("pcs_romance")
+# Define some input texts to punctuate, at least one per language
 input_texts: List[str] = [
     "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
     "hola amigo cómo estás es un día lluvioso hoy",
+    "hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat",
+    "ciao amico come va oggi è stata una giornata piovosa",
+    "olá amigo como tá indo estava chuvoso hoje",
+    "salut l'ami comment ça va il pleuvait aujourd'hui",
+    "salut prietene cum stă treaba azi a fost ploios",
 ]
 results: List[List[str]] = m.infer(input_texts)
 for input_text, output_texts in zip(input_texts, results):
 If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
 # Contact
+Contact me at shane.carroll@utsa.edu with requests or issues, or just let me know on the community tab.
 # Metrics
 Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
+Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all letters.
+Since punctuation is subjective (e.g., see "hello friend how's it going" in the above examples) punctuation metrics can be misleading.
+Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a 50 instances of "¿" which should not appear).
 Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
 we predict it separate from the other punctuation tokens.
+Generally, periods are easy, commas are a harder, question marks are hard, and acronyms are rare and noisy.
 Expand any of the following tabs to see metrics for that language.