mmarimon commited on
Commit
9eddec6
·
1 Parent(s): 9b33a8f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -126
README.md CHANGED
@@ -18,40 +18,128 @@ license: apache-2.0
18
 
19
  # BERTa: RoBERTa-based Catalan language model
20
 
21
- ## BibTeX citation
22
-
23
- If you use any of these resources (datasets or models) in your work, please cite our latest paper:
24
-
25
- ```bibtex
26
- @inproceedings{armengol-estape-etal-2021-multilingual,
27
- title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
28
- author = "Armengol-Estap{\'e}, Jordi and
29
- Carrino, Casimiro Pio and
30
- Rodriguez-Penagos, Carlos and
31
- de Gibert Bonet, Ona and
32
- Armentano-Oller, Carme and
33
- Gonzalez-Agirre, Aitor and
34
- Melero, Maite and
35
- Villegas, Marta",
36
- booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
37
- month = aug,
38
- year = "2021",
39
- address = "Online",
40
- publisher = "Association for Computational Linguistics",
41
- url = "https://aclanthology.org/2021.findings-acl.437",
42
- doi = "10.18653/v1/2021.findings-acl.437",
43
- pages = "4933--4946",
44
- }
45
- ```
46
 
47
 
48
  ## Model description
49
-
50
  BERTa is a transformer-based masked language model for the Catalan language.
51
  It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
52
  and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
53
 
54
- ## Training corpora and preprocessing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  The training corpus consists of several corpora gathered from web crawling and public corpora.
57
 
@@ -83,17 +171,19 @@ Finally, the corpora are concatenated and further global deduplication among the
83
  The final training corpus consists of about 1,8B tokens.
84
 
85
 
86
- ## Tokenization and pretraining
87
 
88
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
89
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens.
 
90
  The BERTa pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
91
  with the same hyperparameters as in the original work.
 
92
  The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
93
 
94
  ## Evaluation
95
 
96
- ## CLUB benchmark
97
 
98
  The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
99
  that has been created along with the model.
@@ -137,7 +227,7 @@ Here are the train/dev/test splits of the datasets:
137
 
138
  _The fine-tuning on downstream tasks have been performed with the HuggingFace [**Transformers**](https://github.com/huggingface/transformers) library_
139
 
140
- ## Results
141
 
142
  Below the evaluation results on the CLUB tasks compared with the multilingual mBERT, XLM-RoBERTa models and
143
  the Catalan WikiBERT-ca model
@@ -151,112 +241,50 @@ the Catalan WikiBERT-ca model
151
  | WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 |
152
 
153
 
154
- ## Intended uses & limitations
155
- The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
156
- However, the is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
157
-
158
- ---
159
-
160
- ## Using BERTa
161
- ## Load model and tokenizer
162
-
163
- ``` python
164
- from transformers import AutoTokenizer, AutoModelForMaskedLM
165
-
166
- tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-ca-cased")
167
-
168
- model = AutoModelForMaskedLM.from_pretrained("PlanTL-GOB-ES/roberta-base-ca-cased")
169
- ```
170
-
171
- ## Fill Mask task
172
-
173
- Below, an example of how to use the masked language modelling task with a pipeline.
174
-
175
- ```python
176
- >>> from transformers import pipeline
177
- >>> unmasker = pipeline('fill-mask', model='PlanTL-GOB-ES/roberta-base-ca-cased')
178
- >>> unmasker("Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
179
- "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
180
- "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
181
- "i pel nord-oest per la serralada de Collserola "
182
- "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
183
- "la línia de costa encaixant la ciutat en un perímetre molt definit.")
184
-
185
- [
186
- {
187
- "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
188
- "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
189
- "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
190
- "i pel nord-oest per la serralada de Collserola "
191
- "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
192
- "la línia de costa encaixant la ciutat en un perímetre molt definit.",
193
- "score": 0.4177263379096985,
194
- "token": 734,
195
- "token_str": " Barcelona"
196
- },
197
- {
198
- "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
199
- "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
200
- "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
201
- "i pel nord-oest per la serralada de Collserola "
202
- "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
203
- "la línia de costa encaixant la ciutat en un perímetre molt definit.",
204
- "score": 0.10696165263652802,
205
- "token": 3849,
206
- "token_str": " Badalona"
207
- },
208
- {
209
- "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
210
- "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
211
- "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
212
- "i pel nord-oest per la serralada de Collserola "
213
- "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
214
- "la línia de costa encaixant la ciutat en un perímetre molt definit.",
215
- "score": 0.08135009557008743,
216
- "token": 19349,
217
- "token_str": " Collserola"
218
- },
219
- {
220
- "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
221
- "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
222
- "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
223
- "i pel nord-oest per la serralada de Collserola "
224
- "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
225
- "la línia de costa encaixant la ciutat en un perímetre molt definit.",
226
- "score": 0.07330769300460815,
227
- "token": 4974,
228
- "token_str": " Terrassa"
229
- },
230
- {
231
- "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
232
- "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
233
- "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
234
- "i pel nord-oest per la serralada de Collserola "
235
- "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
236
- "la línia de costa encaixant la ciutat en un perímetre molt definit.",
237
- "score": 0.03317456692457199,
238
- "token": 14333,
239
- "token_str": " Gavà"
240
- }
241
- ]
242
- ```
243
 
244
- This model was originally published as [bsc/roberta-base-ca-cased](https://huggingface.co/bsc/roberta-base-ca-cased).
 
245
 
246
- ## Copyright
 
247
 
 
248
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
249
 
250
- ## Licensing information
251
-
252
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
253
 
254
- ## Funding
255
-
256
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
257
 
258
- ## Disclaimer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259
 
 
260
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
261
 
262
  When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.
 
18
 
19
  # BERTa: RoBERTa-based Catalan language model
20
 
21
+ ## Table of contents
22
+ <details>
23
+ <summary>Click to expand</summary>
24
+
25
+ - [Model description](#model-description)
26
+ - [Intended uses and limitations](#intended-use)
27
+ - [How to use](#how-to-use)
28
+ - [Limitations and bias](#limitations-and-bias)
29
+ - [Training](#training)
30
+ - [Evaluation](#evaluation)
31
+ - [Additional information](#additional-information)
32
+ - [Author](#author)
33
+ - [Contact information](#contact-information)
34
+ - [Copyright](#copyright)
35
+ - [Licensing information](#licensing-information)
36
+ - [Funding](#funding)
37
+ - [Citing information](#citing-information)
38
+ - [Disclaimer](#disclaimer)
39
+
40
+ </details>
 
 
 
 
 
41
 
42
 
43
  ## Model description
 
44
  BERTa is a transformer-based masked language model for the Catalan language.
45
  It is based on the [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) base model
46
  and has been trained on a medium-size corpus collected from publicly available corpora and crawlers.
47
 
48
+ This model was originally published as [bsc/roberta-base-ca-cased](https://huggingface.co/bsc/roberta-base-ca-cased).
49
+
50
+ ## Intended uses and limitations
51
+ The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section).
52
+ However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
53
+
54
+
55
+ ## How to use
56
+
57
+ ### Load model and tokenizer
58
+
59
+ ``` python
60
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
61
+ tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-ca-cased")
62
+ model = AutoModelForMaskedLM.from_pretrained("PlanTL-GOB-ES/roberta-base-ca-cased")
63
+ ```
64
+
65
+ ### Fill Mask task
66
+
67
+ Below, an example of how to use the masked language modelling task with a pipeline.
68
+
69
+ ```python
70
+ >>> from transformers import pipeline
71
+ >>> unmasker = pipeline('fill-mask', model='PlanTL-GOB-ES/roberta-base-ca-cased')
72
+ >>> unmasker("Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
73
+ "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
74
+ "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
75
+ "i pel nord-oest per la serralada de Collserola "
76
+ "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
77
+ "la línia de costa encaixant la ciutat en un perímetre molt definit.")
78
+
79
+ [
80
+ {
81
+ "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
82
+ "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
83
+ "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
84
+ "i pel nord-oest per la serralada de Collserola "
85
+ "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
86
+ "la línia de costa encaixant la ciutat en un perímetre molt definit.",
87
+ "score": 0.4177263379096985,
88
+ "token": 734,
89
+ "token_str": " Barcelona"
90
+ },
91
+ {
92
+ "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
93
+ "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
94
+ "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
95
+ "i pel nord-oest per la serralada de Collserola "
96
+ "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
97
+ "la línia de costa encaixant la ciutat en un perímetre molt definit.",
98
+ "score": 0.10696165263652802,
99
+ "token": 3849,
100
+ "token_str": " Badalona"
101
+ },
102
+ {
103
+ "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
104
+ "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
105
+ "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
106
+ "i pel nord-oest per la serralada de Collserola "
107
+ "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
108
+ "la línia de costa encaixant la ciutat en un perímetre molt definit.",
109
+ "score": 0.08135009557008743,
110
+ "token": 19349,
111
+ "token_str": " Collserola"
112
+ },
113
+ {
114
+ "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
115
+ "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
116
+ "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
117
+ "i pel nord-oest per la serralada de Collserola "
118
+ "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
119
+ "la línia de costa encaixant la ciutat en un perímetre molt definit.",
120
+ "score": 0.07330769300460815,
121
+ "token": 4974,
122
+ "token_str": " Terrassa"
123
+ },
124
+ {
125
+ "sequence": " Situada a la costa de la mar Mediterrània, <mask> s'assenta en una plana formada "
126
+ "entre els deltes de les desembocadures dels rius Llobregat, al sud-oest, "
127
+ "i Besòs, al nord-est, i limitada pel sud-est per la línia de costa,"
128
+ "i pel nord-oest per la serralada de Collserola "
129
+ "(amb el cim del Tibidabo, 516,2 m, com a punt més alt) que segueix paral·lela "
130
+ "la línia de costa encaixant la ciutat en un perímetre molt definit.",
131
+ "score": 0.03317456692457199,
132
+ "token": 14333,
133
+ "token_str": " Gavà"
134
+ }
135
+ ]
136
+ ```
137
+
138
+
139
+ ## Limitations and bias
140
+
141
+ ## Training
142
+ ### Training corpora and preprocessing
143
 
144
  The training corpus consists of several corpora gathered from web crawling and public corpora.
145
 
 
171
  The final training corpus consists of about 1,8B tokens.
172
 
173
 
174
+ ### Tokenization and pretraining
175
 
176
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
177
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens.
178
+
179
  The BERTa pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
180
  with the same hyperparameters as in the original work.
181
+
182
  The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
183
 
184
  ## Evaluation
185
 
186
+ ### CLUB benchmark
187
 
188
  The BERTa model has been fine-tuned on the downstream tasks of the Catalan Language Understanding Evaluation benchmark (CLUB),
189
  that has been created along with the model.
 
227
 
228
  _The fine-tuning on downstream tasks have been performed with the HuggingFace [**Transformers**](https://github.com/huggingface/transformers) library_
229
 
230
+ ### Results
231
 
232
  Below the evaluation results on the CLUB tasks compared with the multilingual mBERT, XLM-RoBERTa models and
233
  the Catalan WikiBERT-ca model
 
241
  | WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 |
242
 
243
 
244
+ ## Additional information
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
245
 
246
+ ### Author
247
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
248
 
249
+ ### Contact information
250
+ For further information, send an email to <plantl-gob-es@bsc.es>
251
 
252
+ ### Copyright
253
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
254
 
255
+ ### Licensing information
 
256
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
257
 
258
+ ### Funding
 
259
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
260
 
261
+ ### Citing information
262
+
263
+ If you use this model, please cite our latest paper:
264
+
265
+ ```bibtex
266
+ @inproceedings{armengol-estape-etal-2021-multilingual,
267
+ title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
268
+ author = "Armengol-Estap{\'e}, Jordi and
269
+ Carrino, Casimiro Pio and
270
+ Rodriguez-Penagos, Carlos and
271
+ de Gibert Bonet, Ona and
272
+ Armentano-Oller, Carme and
273
+ Gonzalez-Agirre, Aitor and
274
+ Melero, Maite and
275
+ Villegas, Marta",
276
+ booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
277
+ month = aug,
278
+ year = "2021",
279
+ address = "Online",
280
+ publisher = "Association for Computational Linguistics",
281
+ url = "https://aclanthology.org/2021.findings-acl.437",
282
+ doi = "10.18653/v1/2021.findings-acl.437",
283
+ pages = "4933--4946",
284
+ }
285
+ ```
286
 
287
+ ### Disclaimer
288
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
289
 
290
  When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.