mmarimon commited on
Commit
a9bffd7
1 Parent(s): 96bf243

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -34
README.md CHANGED
@@ -33,22 +33,21 @@ widget:
33
  <summary>Click to expand</summary>
34
 
35
  - [Overview](#overview)
36
- - [Model Description](#model-description)
37
- - [Intended Uses and Limitations](#intended-uses-and-limitations)
38
- - [How to Use](#how-to-use)
39
  - [Limitations and bias](#limitations-and-bias)
40
  - [Training](#training)
41
- - [Training Data](#training-data)
42
- - [Training Procedure](#training-procedure)
43
  - [Evaluation](#evaluation)
44
- - [Evaluation Results](#evaluation-results)
45
- - [Additional Information](#additional-information)
46
- - [Contact Information](#contact-information)
47
  - [Copyright](#copyright)
48
- - [Licensing Information](#licensing-information)
49
  - [Funding](#funding)
50
  - [Citation Information](#citation-information)
51
- - [Contributions](#contributions)
52
  - [Disclaimer](#disclaimer)
53
 
54
  </details>
@@ -59,18 +58,13 @@ widget:
59
  - **Task:** fill-mask
60
  - **Data:** BNE
61
 
62
- ## Model Description
63
  RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
64
 
65
-
66
- ## Intended Uses and Limitations
67
-
68
  You can use the raw model for fill mask or fine-tune it to a downstream task.
69
 
70
-
71
-
72
-
73
- ## How to Use
74
  You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
75
 
76
  ```python
@@ -167,12 +161,11 @@ At the time of submission, no measures have been taken to estimate the bias and
167
  'sequence': 'Mohammed está pensando en ello.',
168
  'token': 1577,
169
  'token_str': ' ello'}]
170
-
171
  ```
172
 
173
  ## Training
174
 
175
- ### Training Data
176
 
177
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
178
 
@@ -184,7 +177,8 @@ Some of the statistics of the corpus:
184
  |---------|---------------------|------------------|-----------|
185
  | BNE | 201,080,084 | 135,733,450,668 | 570GB |
186
 
187
- ### Training Procedure
 
188
  The configuration of the **RoBERTa-base-bne** model is as follows:
189
  - RoBERTa-b: 12-layer, 768-hidden, 12-heads, 125M parameters.
190
 
@@ -194,8 +188,8 @@ The RoBERTa-base-bne pre-training consists of a masked language model training t
194
 
195
  ## Evaluation
196
 
197
- ### Evaluation Results
198
  When fine-tuned on downstream tasks, this model achieves the following results:
 
199
  | Dataset | Metric | [**RoBERTa-b**](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) |
200
  |--------------|----------|------------|
201
  | MLDoc | F1 | 0.9664 |
@@ -210,25 +204,24 @@ When fine-tuned on downstream tasks, this model achieves the following results:
210
 
211
  For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
212
 
213
- ## Additional Information
214
 
215
- ### Contact Information
 
216
 
 
217
  For further information, send an email to <plantl-gob-es@bsc.es>
218
 
219
  ### Copyright
220
-
221
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
222
 
223
- ### Licensing Information
224
-
225
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
226
 
227
  ### Funding
228
-
229
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
230
 
231
- ### Citation Information
232
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
233
  ```
234
  @article{,
@@ -246,13 +239,8 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
246
  url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
247
  year = {2022},
248
  }
249
-
250
  ```
251
 
252
- ### Contributions
253
-
254
- [N/A]
255
-
256
  ### Disclaimer
257
 
258
  <details>
 
33
  <summary>Click to expand</summary>
34
 
35
  - [Overview](#overview)
36
+ - [Model description](#model-description)
37
+ - [Intended uses and limitations](#intended-uses-and-limitations)
38
+ - [How to use](#how-to-use)
39
  - [Limitations and bias](#limitations-and-bias)
40
  - [Training](#training)
41
+ - [Training data](#training-data)
42
+ - [Training procedure](#training-procedure)
43
  - [Evaluation](#evaluation)
44
+ - [Additional information](#additional-information)
45
+ - [Author](#author)
46
+ - [Contact information](#contact-information)
47
  - [Copyright](#copyright)
48
+ - [Licensing information](#licensing-information)
49
  - [Funding](#funding)
50
  - [Citation Information](#citation-information)
 
51
  - [Disclaimer](#disclaimer)
52
 
53
  </details>
 
58
  - **Task:** fill-mask
59
  - **Data:** BNE
60
 
61
+ ## Model description
62
  RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
63
 
64
+ ## Intended uses and limitations
 
 
65
  You can use the raw model for fill mask or fine-tune it to a downstream task.
66
 
67
+ ## How to use
 
 
 
68
  You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
69
 
70
  ```python
 
161
  'sequence': 'Mohammed está pensando en ello.',
162
  'token': 1577,
163
  'token_str': ' ello'}]
 
164
  ```
165
 
166
  ## Training
167
 
168
+ ### Training data
169
 
170
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
171
 
 
177
  |---------|---------------------|------------------|-----------|
178
  | BNE | 201,080,084 | 135,733,450,668 | 570GB |
179
 
180
+
181
+ ### Training procedure
182
  The configuration of the **RoBERTa-base-bne** model is as follows:
183
  - RoBERTa-b: 12-layer, 768-hidden, 12-heads, 125M parameters.
184
 
 
188
 
189
  ## Evaluation
190
 
 
191
  When fine-tuned on downstream tasks, this model achieves the following results:
192
+
193
  | Dataset | Metric | [**RoBERTa-b**](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) |
194
  |--------------|----------|------------|
195
  | MLDoc | F1 | 0.9664 |
 
204
 
205
  For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
206
 
207
+ ## Additional information
208
 
209
+ ### Author
210
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
211
 
212
+ ### Contact information
213
  For further information, send an email to <plantl-gob-es@bsc.es>
214
 
215
  ### Copyright
 
216
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
217
 
218
+ ### Licensing information
 
219
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
220
 
221
  ### Funding
 
222
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
223
 
224
+ ### Citation information
225
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
226
  ```
227
  @article{,
 
239
  url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
240
  year = {2022},
241
  }
 
242
  ```
243
 
 
 
 
 
244
  ### Disclaimer
245
 
246
  <details>