nazneen commited on
Commit
3684bd2
1 Parent(s): 8745d34

model documentation

Browse files
Files changed (1) hide show
  1. README.md +185 -4
README.md CHANGED
@@ -5,9 +5,11 @@ language:
5
  - fr
6
  - it
7
  - nl
 
8
  tags:
9
  - punctuation prediction
10
  - punctuation
 
11
  datasets: wmt/europarl
12
  license: mit
13
  widget:
@@ -18,14 +20,103 @@ widget:
18
  - text: "Ist das eine Frage Frau Müller"
19
  example_title: "German"
20
  - text: "My name is Clara and I live in Berkeley California"
21
- example_title: "English"
 
22
  metrics:
23
  - f1
24
  ---
25
 
26
- # Work in progress
27
 
28
- ## Classification report over all languages
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ```
30
  precision recall f1-score support
31
 
@@ -39,4 +130,94 @@ metrics:
39
  accuracy 0.98 54504270
40
  macro avg 0.83 0.75 0.78 54504270
41
  weighted avg 0.98 0.98 0.98 54504270
42
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - fr
6
  - it
7
  - nl
8
+
9
  tags:
10
  - punctuation prediction
11
  - punctuation
12
+
13
  datasets: wmt/europarl
14
  license: mit
15
  widget:
 
20
  - text: "Ist das eine Frage Frau Müller"
21
  example_title: "German"
22
  - text: "My name is Clara and I live in Berkeley California"
23
+ example_title: "English"
24
+
25
  metrics:
26
  - f1
27
  ---
28
 
 
29
 
30
+ # Model Card for fullstop-punctuation-multilingual-base
31
+
32
+ # Model Details
33
+
34
+ ## Model Description
35
+
36
+ The goal of this task consists in training NLP models that can predict the end of sentence (EOS) and punctuation marks on automatically generated or transcribed texts.
37
+
38
+ - **Developed by:** Oliver Guhr
39
+ - **Shared by [Optional]:** Oliver Guhr
40
+ - **Model type:** Token Classification
41
+ - **Language(s) (NLP):** English, German, French, Italian, Dutch
42
+ - **License:** MIT
43
+ - **Parent Model:** xlm-roberta-base
44
+ - **Resources for more information:**
45
+ - [GitHub Repo](https://github.com/oliverguhr/fullstop-deep-punctuation-prediction)
46
+ - [Associated Paper](https://www.researchgate.net/profile/Oliver-Guhr/publication/355038679_FullStop_Multilingual_Deep_Models_for_Punctuation_Prediction/links/615a0ce3a6fae644fbd08724/FullStop-Multilingual-Deep-Models-for-Punctuation-Prediction.pdf)
47
+
48
+
49
+
50
+ # Uses
51
+
52
+
53
+ ## Direct Use
54
+ This model can be used for the task of Token Classification
55
+
56
+ ## Downstream Use [Optional]
57
+
58
+ More information needed.
59
+
60
+ ## Out-of-Scope Use
61
+
62
+ The model should not be used to intentionally create hostile or alienating environments for people.
63
+
64
+ # Bias, Risks, and Limitations
65
+
66
+
67
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
68
+
69
+
70
+
71
+ ## Recommendations
72
+
73
+
74
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
75
+
76
+ # Training Details
77
+
78
+ ## Training Data
79
+
80
+ The model authors note in the [associated paper](https://www.researchgate.net/profile/Oliver-Guhr/publication/355038679_FullStop_Multilingual_Deep_Models_for_Punctuation_Prediction/links/615a0ce3a6fae644fbd08724/FullStop-Multilingual-Deep-Models-for-Punctuation-Prediction.pdf):
81
+ > The task consists in predicting EOS and punctua- tion marks on unpunctuated lowercased text. The organizers of the SeppNLG shared task provided 470 MB of English, German, French, and Italian text. This data set consists of a training and a de- velopment set.
82
+
83
+
84
+ ## Training Procedure
85
+
86
+
87
+ ### Preprocessing
88
+
89
+ More information needed
90
+
91
+
92
+
93
+
94
+
95
+ ### Speeds, Sizes, Times
96
+ More information needed
97
+
98
+
99
+ # Evaluation
100
+
101
+
102
+ ## Testing Data, Factors & Metrics
103
+
104
+ ### Testing Data
105
+
106
+ More information needed
107
+
108
+
109
+ ### Factors
110
+ More information needed
111
+
112
+ ### Metrics
113
+
114
+ More information needed
115
+
116
+
117
+ ## Results
118
+
119
+ ### Classification report over all languages
120
  ```
121
  precision recall f1-score support
122
 
 
130
  accuracy 0.98 54504270
131
  macro avg 0.83 0.75 0.78 54504270
132
  weighted avg 0.98 0.98 0.98 54504270
133
+ ```
134
+
135
+
136
+
137
+ # Model Examination
138
+
139
+ More information needed
140
+
141
+ # Environmental Impact
142
+
143
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
144
+
145
+ - **Hardware Type:** More information needed
146
+ - **Hours used:** More information needed
147
+ - **Cloud Provider:** More information needed
148
+ - **Compute Region:** More information needed
149
+ - **Carbon Emitted:** More information needed
150
+
151
+ # Technical Specifications [optional]
152
+
153
+ ## Model Architecture and Objective
154
+
155
+ More information needed
156
+
157
+ ## Compute Infrastructure
158
+
159
+ More information needed
160
+
161
+ ### Hardware
162
+
163
+
164
+ More information needed
165
+
166
+ ### Software
167
+
168
+ More information needed.
169
+
170
+ # Citation
171
+
172
+
173
+ **BibTeX:**
174
+
175
+
176
+ ```bibtex
177
+ @article{guhr-EtAl:2021:fullstop,
178
+ title={FullStop: Multilingual Deep Models for Punctuation Prediction},
179
+ author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},
180
+ booktitle = {Proceedings of the Swiss Text Analytics Conference 2021},
181
+ month = {June},
182
+ year = {2021},
183
+ address = {Winterthur, Switzerland},
184
+ publisher = {CEUR Workshop Proceedings},
185
+ url = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
186
+ }
187
+ ```
188
+
189
+
190
+
191
+
192
+ # Glossary [optional]
193
+ More information needed
194
+
195
+ # More Information [optional]
196
+ More information needed
197
+
198
+
199
+ # Model Card Authors [optional]
200
+
201
+ Oliver Guhr in collaboration with Ezi Ozoani and the Hugging Face team
202
+
203
+
204
+ # Model Card Contact
205
+
206
+ More information needed
207
+
208
+ # How to Get Started with the Model
209
+
210
+ Use the code below to get started with the model.
211
+
212
+ <details>
213
+ <summary> Click to expand </summary>
214
+
215
+ ```python
216
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
217
+
218
+ tokenizer = AutoTokenizer.from_pretrained("oliverguhr/fullstop-punctuation-multilingual-base")
219
+
220
+ model = AutoModelForTokenClassification.from_pretrained("oliverguhr/fullstop-punctuation-multilingual-base")
221
+ ```
222
+ </details>
223
+