gonzalez-agirre commited on
Commit
eefba94
1 Parent(s): a9bffd7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -12
README.md CHANGED
@@ -7,15 +7,21 @@ language:
7
  license: apache-2.0
8
 
9
  tags:
 
10
  - "national library of spain"
 
11
  - "spanish"
 
12
  - "bne"
 
13
  - "roberta-base-bne"
14
 
15
  datasets:
 
16
  - "bne"
17
 
18
  metrics:
 
19
  - "ppl"
20
 
21
  widget:
@@ -59,13 +65,15 @@ widget:
59
  - **Data:** BNE
60
 
61
  ## Model description
62
- RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
63
 
64
  ## Intended uses and limitations
 
 
65
  You can use the raw model for fill mask or fine-tune it to a downstream task.
66
 
67
  ## How to use
68
- You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
69
 
70
  ```python
71
  >>> from transformers import pipeline
@@ -169,7 +177,7 @@ At the time of submission, no measures have been taken to estimate the bias and
169
 
170
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
171
 
172
- To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among the others, sentence splitting, language detection, filtering of bad-formed sentences and deduplication of repetitive contents. During the process document boundaries are kept. This resulted into 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting into 570GB of text.
173
 
174
  Some of the statistics of the corpus:
175
 
@@ -179,18 +187,15 @@ Some of the statistics of the corpus:
179
 
180
 
181
  ### Training procedure
182
- The configuration of the **RoBERTa-base-bne** model is as follows:
183
- - RoBERTa-b: 12-layer, 768-hidden, 12-heads, 125M parameters.
184
-
185
- The pretraining objective used for this architecture is masked language modeling without next sentence prediction.
186
- The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens.
187
- The RoBERTa-base-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa base. The training lasted a total of 48 hours with 16 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
188
 
189
  ## Evaluation
190
 
191
  When fine-tuned on downstream tasks, this model achieves the following results:
192
 
193
- | Dataset | Metric | [**RoBERTa-b**](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) |
194
  |--------------|----------|------------|
195
  | MLDoc | F1 | 0.9664 |
196
  | CoNLL-NERC | F1 | 0.8851 |
@@ -213,13 +218,13 @@ Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
213
  For further information, send an email to <plantl-gob-es@bsc.es>
214
 
215
  ### Copyright
216
- Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
217
 
218
  ### Licensing information
219
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
220
 
221
  ### Funding
222
- This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
223
 
224
  ### Citation information
225
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
 
7
  license: apache-2.0
8
 
9
  tags:
10
+
11
  - "national library of spain"
12
+
13
  - "spanish"
14
+
15
  - "bne"
16
+
17
  - "roberta-base-bne"
18
 
19
  datasets:
20
+
21
  - "bne"
22
 
23
  metrics:
24
+
25
  - "ppl"
26
 
27
  widget:
 
65
  - **Data:** BNE
66
 
67
  ## Model description
68
+ The **roberta-base-bne** is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
69
 
70
  ## Intended uses and limitations
71
+ The **roberta-base-bne** model is ready-to-use only for masked language modeling to perform the Fill Mask task (try the inference API or read the next section).
72
+ However, it is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification, or Named Entity Recognition.
73
  You can use the raw model for fill mask or fine-tune it to a downstream task.
74
 
75
  ## How to use
76
+ Here is how to use this model:
77
 
78
  ```python
79
  >>> from transformers import pipeline
 
177
 
178
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
179
 
180
+ To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among others, sentence splitting, language detection, filtering of bad-formed sentences, and deduplication of repetitive contents. During the process, document boundaries are kept. This resulted in 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting in 570GB of text.
181
 
182
  Some of the statistics of the corpus:
183
 
 
187
 
188
 
189
  ### Training procedure
190
+ The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens.
191
+
192
+ The **roberta-base-bne** pre-training consists of a masked language model training, that follows the approach employed for the RoBERTa base. The training lasted a total of 48 hours with 16 computing nodes, each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
 
 
 
193
 
194
  ## Evaluation
195
 
196
  When fine-tuned on downstream tasks, this model achieves the following results:
197
 
198
+ | Dataset | Metric | [**RoBERTa-base**](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) |
199
  |--------------|----------|------------|
200
  | MLDoc | F1 | 0.9664 |
201
  | CoNLL-NERC | F1 | 0.8851 |
 
218
  For further information, send an email to <plantl-gob-es@bsc.es>
219
 
220
  ### Copyright
221
+ Copyright by the [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) (2022)
222
 
223
  ### Licensing information
224
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
225
 
226
  ### Funding
227
+ This work was funded by the [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the Plan-TL.
228
 
229
  ### Citation information
230
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):