1-800-BAD-CODE commited on
Commit
76f7f82
1 Parent(s): 874abfd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -6
README.md CHANGED
@@ -66,10 +66,15 @@ from punctuators.models import PunctCapSegModelONNX
66
  # This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
67
  m = PunctCapSegModelONNX.from_pretrained("pcs_romance")
68
 
69
- # Define some input texts to punctuate
70
  input_texts: List[str] = [
71
  "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
72
  "hola amigo cómo estás es un día lluvioso hoy",
 
 
 
 
 
73
  ]
74
  results: List[List[str]] = m.infer(input_texts)
75
  for input_text, output_texts in zip(input_texts, results):
@@ -119,20 +124,20 @@ This is accomplished behind the scenes by splitting the input into overlapping s
119
  If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
120
 
121
  # Contact
122
- Contact me at shane.carroll@utsa.edu with requests or issues, or on the community tab.
123
 
124
  # Metrics
125
  Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
126
- Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all text.
127
 
128
- Since punctuation is sometimes subjective (e.g., see the example outputs above) punctuation metrics can be misleading.
129
 
130
- Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a support of 50 for "¿" which should not appear).
131
 
132
  Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
133
  we predict it separate from the other punctuation tokens.
134
 
135
- Generally, acronyms are rare and difficult, periods are easy, commas are a bit hard, and question marks are hard.
136
 
137
  Expand any of the following tabs to see metrics for that language.
138
 
 
66
  # This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
67
  m = PunctCapSegModelONNX.from_pretrained("pcs_romance")
68
 
69
+ # Define some input texts to punctuate, at least one per language
70
  input_texts: List[str] = [
71
  "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
72
  "hola amigo cómo estás es un día lluvioso hoy",
73
+ "hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat",
74
+ "ciao amico come va oggi è stata una giornata piovosa",
75
+ "olá amigo como tá indo estava chuvoso hoje",
76
+ "salut l'ami comment ça va il pleuvait aujourd'hui",
77
+ "salut prietene cum stă treaba azi a fost ploios",
78
  ]
79
  results: List[List[str]] = m.infer(input_texts)
80
  for input_text, output_texts in zip(input_texts, results):
 
124
  If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
125
 
126
  # Contact
127
+ Contact me at shane.carroll@utsa.edu with requests or issues, or just let me know on the community tab.
128
 
129
  # Metrics
130
  Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
131
+ Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all letters.
132
 
133
+ Since punctuation is subjective (e.g., see "hello friend how's it going" in the above examples) punctuation metrics can be misleading.
134
 
135
+ Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a 50 instances of "¿" which should not appear).
136
 
137
  Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
138
  we predict it separate from the other punctuation tokens.
139
 
140
+ Generally, periods are easy, commas are a harder, question marks are hard, and acronyms are rare and noisy.
141
 
142
  Expand any of the following tabs to see metrics for that language.
143