mmarimon commited on
Commit
c1ee3e0
1 Parent(s): 56cc8af

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -27
README.md CHANGED
@@ -32,22 +32,20 @@ widget:
32
  <summary>Click to expand</summary>
33
 
34
  - [Overview](#overview)
35
- - [Model Description](#model-description)
36
- - [Intended Uses and Limitations](#intended-uses-and-limitations)
37
- - [How to Use](#how-to-use)
38
  - [Limitations and bias](#limitations-and-bias)
39
  - [Training](#training)
40
- - [Training Data](#training-data)
41
- - [Training Procedure](#training-procedure)
42
  - [Evaluation](#evaluation)
43
- - [Evaluation Results](#evaluation-results)
44
- - [Additional Information](#additional-information)
45
- - [Contact Information](#contact-information)
46
  - [Copyright](#copyright)
47
- - [Licensing Information](#licensing-information)
48
  - [Funding](#funding)
49
  - [Citation Information](#citation-information)
50
- - [Contributions](#contributions)
51
  - [Disclaimer](#disclaimer)
52
 
53
  </details>
@@ -58,13 +56,13 @@ widget:
58
  - **Task:** fill-mask
59
  - **Data:** BNE
60
 
61
- ## Model Description
62
  RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
63
 
64
- ## Intended Uses and Limitations
65
  You can use the raw model for fill mask or fine-tune it to a downstream task.
66
 
67
- ## How to Use
68
  You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
69
 
70
  ```python
@@ -112,7 +110,7 @@ At the time of submission, no measures have been taken to estimate the bias and
112
 
113
  ## Training
114
 
115
- ### Training Data
116
 
117
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
118
 
@@ -124,7 +122,7 @@ Some of the statistics of the corpus:
124
  |---------|---------------------|------------------|-----------|
125
  | BNE | 201,080,084 | 135,733,450,668 | 570GB |
126
 
127
- ### Training Procedure
128
  The configuration of the **RoBERTa-large-bne** model is as follows:
129
  - RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
130
 
@@ -134,7 +132,6 @@ The RoBERTa-large-bne pre-training consists of a masked language model training
134
 
135
  ## Evaluation
136
 
137
- ### Evaluation Results
138
  When fine-tuned on downstream tasks, this model achieves the following results:
139
  | Dataset | Metric | [**RoBERTa-l**](https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne) |
140
  |--------------|----------|------------|
@@ -150,25 +147,24 @@ When fine-tuned on downstream tasks, this model achieves the following results:
150
 
151
  For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
152
 
153
- ## Additional Information
154
 
155
- ### Contact Information
 
156
 
 
157
  For further information, send an email to <plantl-gob-es@bsc.es>
158
 
159
  ### Copyright
160
-
161
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
162
 
163
- ### Licensing Information
164
-
165
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
166
 
167
  ### Funding
168
-
169
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
170
 
171
- ### Citation Information
172
 
173
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
174
  ```
@@ -189,10 +185,6 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
189
  }
190
  ```
191
 
192
- ### Contributions
193
-
194
- [N/A]
195
-
196
  ### Disclaimer
197
 
198
  <details>
 
32
  <summary>Click to expand</summary>
33
 
34
  - [Overview](#overview)
35
+ - [Model description](#model-description)
36
+ - [Intended uses and limitations](#intended-uses-and-limitations)
37
+ - [How to use](#how-to-use)
38
  - [Limitations and bias](#limitations-and-bias)
39
  - [Training](#training)
40
+ - [Training data](#training-data)
41
+ - [Training procedure](#training-procedure)
42
  - [Evaluation](#evaluation)
43
+ - [Additional information](#additional-information)
44
+ - [Contact information](#contact-information)
 
45
  - [Copyright](#copyright)
46
+ - [Licensing information](#licensing-information)
47
  - [Funding](#funding)
48
  - [Citation Information](#citation-information)
 
49
  - [Disclaimer](#disclaimer)
50
 
51
  </details>
 
56
  - **Task:** fill-mask
57
  - **Data:** BNE
58
 
59
+ ## Model description
60
  RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
61
 
62
+ ## Intended uses and limitations
63
  You can use the raw model for fill mask or fine-tune it to a downstream task.
64
 
65
+ ## How to use
66
  You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
67
 
68
  ```python
 
110
 
111
  ## Training
112
 
113
+ ### Training data
114
 
115
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
116
 
 
122
  |---------|---------------------|------------------|-----------|
123
  | BNE | 201,080,084 | 135,733,450,668 | 570GB |
124
 
125
+ ### Training procedure
126
  The configuration of the **RoBERTa-large-bne** model is as follows:
127
  - RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
128
 
 
132
 
133
  ## Evaluation
134
 
 
135
  When fine-tuned on downstream tasks, this model achieves the following results:
136
  | Dataset | Metric | [**RoBERTa-l**](https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne) |
137
  |--------------|----------|------------|
 
147
 
148
  For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
149
 
150
+ ## Additional information
151
 
152
+ ### Author
153
+ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
154
 
155
+ ### Contact information
156
  For further information, send an email to <plantl-gob-es@bsc.es>
157
 
158
  ### Copyright
 
159
  Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
160
 
161
+ ### Licensing information
 
162
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
163
 
164
  ### Funding
 
165
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
166
 
167
+ ### Citation information
168
 
169
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
170
  ```
 
185
  }
186
  ```
187
 
 
 
 
 
188
  ### Disclaimer
189
 
190
  <details>