Update README.md
Browse files
README.md
CHANGED
@@ -33,22 +33,21 @@ widget:
|
|
33 |
<summary>Click to expand</summary>
|
34 |
|
35 |
- [Overview](#overview)
|
36 |
-
- [Model
|
37 |
-
- [Intended
|
38 |
-
- [How to
|
39 |
- [Limitations and bias](#limitations-and-bias)
|
40 |
- [Training](#training)
|
41 |
-
- [Training
|
42 |
-
- [Training
|
43 |
- [Evaluation](#evaluation)
|
44 |
-
|
45 |
-
- [
|
46 |
-
- [Contact
|
47 |
- [Copyright](#copyright)
|
48 |
-
- [Licensing
|
49 |
- [Funding](#funding)
|
50 |
- [Citation Information](#citation-information)
|
51 |
-
- [Contributions](#contributions)
|
52 |
- [Disclaimer](#disclaimer)
|
53 |
|
54 |
</details>
|
@@ -59,18 +58,13 @@ widget:
|
|
59 |
- **Task:** fill-mask
|
60 |
- **Data:** BNE
|
61 |
|
62 |
-
## Model
|
63 |
RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
|
64 |
|
65 |
-
|
66 |
-
## Intended Uses and Limitations
|
67 |
-
|
68 |
You can use the raw model for fill mask or fine-tune it to a downstream task.
|
69 |
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
## How to Use
|
74 |
You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
|
75 |
|
76 |
```python
|
@@ -167,12 +161,11 @@ At the time of submission, no measures have been taken to estimate the bias and
|
|
167 |
'sequence': 'Mohammed está pensando en ello.',
|
168 |
'token': 1577,
|
169 |
'token_str': ' ello'}]
|
170 |
-
|
171 |
```
|
172 |
|
173 |
## Training
|
174 |
|
175 |
-
### Training
|
176 |
|
177 |
The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
|
178 |
|
@@ -184,7 +177,8 @@ Some of the statistics of the corpus:
|
|
184 |
|---------|---------------------|------------------|-----------|
|
185 |
| BNE | 201,080,084 | 135,733,450,668 | 570GB |
|
186 |
|
187 |
-
|
|
|
188 |
The configuration of the **RoBERTa-base-bne** model is as follows:
|
189 |
- RoBERTa-b: 12-layer, 768-hidden, 12-heads, 125M parameters.
|
190 |
|
@@ -194,8 +188,8 @@ The RoBERTa-base-bne pre-training consists of a masked language model training t
|
|
194 |
|
195 |
## Evaluation
|
196 |
|
197 |
-
### Evaluation Results
|
198 |
When fine-tuned on downstream tasks, this model achieves the following results:
|
|
|
199 |
| Dataset | Metric | [**RoBERTa-b**](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) |
|
200 |
|--------------|----------|------------|
|
201 |
| MLDoc | F1 | 0.9664 |
|
@@ -210,25 +204,24 @@ When fine-tuned on downstream tasks, this model achieves the following results:
|
|
210 |
|
211 |
For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
|
212 |
|
213 |
-
## Additional
|
214 |
|
215 |
-
###
|
|
|
216 |
|
|
|
217 |
For further information, send an email to <plantl-gob-es@bsc.es>
|
218 |
|
219 |
### Copyright
|
220 |
-
|
221 |
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
222 |
|
223 |
-
### Licensing
|
224 |
-
|
225 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
226 |
|
227 |
### Funding
|
228 |
-
|
229 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
230 |
|
231 |
-
### Citation
|
232 |
If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
|
233 |
```
|
234 |
@article{,
|
@@ -246,13 +239,8 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
|
|
246 |
url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
|
247 |
year = {2022},
|
248 |
}
|
249 |
-
|
250 |
```
|
251 |
|
252 |
-
### Contributions
|
253 |
-
|
254 |
-
[N/A]
|
255 |
-
|
256 |
### Disclaimer
|
257 |
|
258 |
<details>
|
|
|
33 |
<summary>Click to expand</summary>
|
34 |
|
35 |
- [Overview](#overview)
|
36 |
+
- [Model description](#model-description)
|
37 |
+
- [Intended uses and limitations](#intended-uses-and-limitations)
|
38 |
+
- [How to use](#how-to-use)
|
39 |
- [Limitations and bias](#limitations-and-bias)
|
40 |
- [Training](#training)
|
41 |
+
- [Training data](#training-data)
|
42 |
+
- [Training procedure](#training-procedure)
|
43 |
- [Evaluation](#evaluation)
|
44 |
+
- [Additional information](#additional-information)
|
45 |
+
- [Author](#author)
|
46 |
+
- [Contact information](#contact-information)
|
47 |
- [Copyright](#copyright)
|
48 |
+
- [Licensing information](#licensing-information)
|
49 |
- [Funding](#funding)
|
50 |
- [Citation Information](#citation-information)
|
|
|
51 |
- [Disclaimer](#disclaimer)
|
52 |
|
53 |
</details>
|
|
|
58 |
- **Task:** fill-mask
|
59 |
- **Data:** BNE
|
60 |
|
61 |
+
## Model description
|
62 |
RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
|
63 |
|
64 |
+
## Intended uses and limitations
|
|
|
|
|
65 |
You can use the raw model for fill mask or fine-tune it to a downstream task.
|
66 |
|
67 |
+
## How to use
|
|
|
|
|
|
|
68 |
You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
|
69 |
|
70 |
```python
|
|
|
161 |
'sequence': 'Mohammed está pensando en ello.',
|
162 |
'token': 1577,
|
163 |
'token_str': ' ello'}]
|
|
|
164 |
```
|
165 |
|
166 |
## Training
|
167 |
|
168 |
+
### Training data
|
169 |
|
170 |
The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
|
171 |
|
|
|
177 |
|---------|---------------------|------------------|-----------|
|
178 |
| BNE | 201,080,084 | 135,733,450,668 | 570GB |
|
179 |
|
180 |
+
|
181 |
+
### Training procedure
|
182 |
The configuration of the **RoBERTa-base-bne** model is as follows:
|
183 |
- RoBERTa-b: 12-layer, 768-hidden, 12-heads, 125M parameters.
|
184 |
|
|
|
188 |
|
189 |
## Evaluation
|
190 |
|
|
|
191 |
When fine-tuned on downstream tasks, this model achieves the following results:
|
192 |
+
|
193 |
| Dataset | Metric | [**RoBERTa-b**](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) |
|
194 |
|--------------|----------|------------|
|
195 |
| MLDoc | F1 | 0.9664 |
|
|
|
204 |
|
205 |
For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
|
206 |
|
207 |
+
## Additional information
|
208 |
|
209 |
+
### Author
|
210 |
+
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
|
211 |
|
212 |
+
### Contact information
|
213 |
For further information, send an email to <plantl-gob-es@bsc.es>
|
214 |
|
215 |
### Copyright
|
|
|
216 |
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
217 |
|
218 |
+
### Licensing information
|
|
|
219 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
220 |
|
221 |
### Funding
|
|
|
222 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
223 |
|
224 |
+
### Citation information
|
225 |
If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
|
226 |
```
|
227 |
@article{,
|
|
|
239 |
url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
|
240 |
year = {2022},
|
241 |
}
|
|
|
242 |
```
|
243 |
|
|
|
|
|
|
|
|
|
244 |
### Disclaimer
|
245 |
|
246 |
<details>
|