joanllop commited on
Commit
9db5bf8
1 Parent(s): 8582afb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -18
README.md CHANGED
@@ -1,15 +1,23 @@
1
  ---
 
2
  language:
 
3
  - es
 
4
  license: apache-2.0
 
5
  tags:
6
  - "national library of spain"
7
  - "spanish"
8
  - "bne"
 
 
9
  datasets:
10
- - "bne"
 
11
  metrics:
12
  - "ppl"
 
13
  widget:
14
  - text: "Este año las campanadas de La Sexta las <mask> Pedroche y Chicote."
15
  - text: "El artista Antonio Orozco es un colaborador de La <mask>."
@@ -19,10 +27,91 @@ widget:
19
  ---
20
  # RoBERTa large trained with data from National Library of Spain (BNE)
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ## Model Description
23
  RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
24
 
25
- ## Training corpora and preprocessing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
27
 
28
  To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among the others, sentence splitting, language detection, filtering of bad-formed sentences and deduplication of repetitive contents. During the process document boundaries are kept. This resulted into 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting into 570GB of text.
@@ -33,13 +122,38 @@ Some of the statistics of the corpus:
33
  |---------|---------------------|------------------|-----------|
34
  | BNE | 201,080,084 | 135,733,450,668 | 570GB |
35
 
36
- ## Tokenization and pre-training
37
- The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens. The RoBERTa-large-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa large. The training lasted a total of 96 hours with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- ## Evaluation and results
40
- For evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish).
41
 
42
- ## Citing
 
 
 
 
43
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
44
  ```
45
  @article{,
@@ -57,31 +171,39 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
57
  url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
58
  year = {2022},
59
  }
60
-
61
  ```
62
- ## Copyright
63
 
64
- Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
65
 
66
- ## Licensing information
67
 
68
- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
69
-
70
- ## Funding
71
 
72
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
73
 
74
- ## Disclaimer
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
77
 
78
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.
79
 
80
- In no event shall the owner of the models (SEDIA – State Secretariat for digitalization and artificial intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
81
 
82
 
83
  Los modelos publicados en este repositorio tienen una finalidad generalista y están a disposición de terceros. Estos modelos pueden tener sesgos y/u otro tipo de distorsiones indeseables.
84
 
85
  Cuando terceros desplieguen o proporcionen sistemas y/o servicios a otras partes usando alguno de estos modelos (o utilizando sistemas basados en estos modelos) o se conviertan en usuarios de los modelos, deben tener en cuenta que es su responsabilidad mitigar los riesgos derivados de su uso y, en todo caso, cumplir con la normativa aplicable, incluyendo la normativa en materia de uso de inteligencia artificial.
86
 
87
- En ningún caso el propietario de los modelos (SEDIA – Secretaría de Estado de Digitalización e Inteligencia Artificial) ni el creador (BSC – Barcelona Supercomputing Center) serán responsables de los resultados derivados del uso que hagan terceros de estos modelos.
 
 
1
  ---
2
+
3
  language:
4
+
5
  - es
6
+
7
  license: apache-2.0
8
+
9
  tags:
10
  - "national library of spain"
11
  - "spanish"
12
  - "bne"
13
+ - "roberta-large-bne"
14
+
15
  datasets:
16
+ - "bne"
17
+
18
  metrics:
19
  - "ppl"
20
+
21
  widget:
22
  - text: "Este año las campanadas de La Sexta las <mask> Pedroche y Chicote."
23
  - text: "El artista Antonio Orozco es un colaborador de La <mask>."
 
27
  ---
28
  # RoBERTa large trained with data from National Library of Spain (BNE)
29
 
30
+ ## Table of Contents
31
+ <details>
32
+ <summary>Click to expand</summary>
33
+
34
+ - [Overview](#overview)
35
+ - [Model Description](#model-description)
36
+ - [How to Use](#how-to-use)
37
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
38
+ - [Training](#training)
39
+ - [Training Data](#training-data)
40
+ - [Training Procedure](#training-procedure)
41
+ - [Evaluation](#evaluation)
42
+ - [Evaluation Results](#evaluation-results)
43
+ - [Additional Information](#additional-information)
44
+ - [Authors](#authors)
45
+ - [Citation Information](#citation-information)
46
+ - [Contact Information](#contact-information)
47
+ - [Funding](#funding)
48
+ - [Licensing Information](#licensing-information)
49
+ - [Copyright](#copyright)
50
+ - [Disclaimer](#disclaimer)
51
+
52
+ </details>
53
+
54
+ ## Overview
55
+ - **Architecture:** roberta-large
56
+ - **Language:** Spanish
57
+ - **Task:** fill-mask
58
+ - **Data:** BNE
59
+
60
  ## Model Description
61
  RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
62
 
63
+ ## How to Use
64
+ You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
65
+
66
+ ```python
67
+ >>> from transformers import pipeline
68
+ >>> from pprint import pprint
69
+ >>> unmasker = pipeline('fill-mask', model='PlanTL-GOB-ES/roberta-large-bne')
70
+ >>> pprint(unmasker("Gracias a los datos de la BNE se ha podido <mask> este modelo del lenguaje."))
71
+ [{'score': 0.0664491355419159,
72
+ 'sequence': ' Gracias a los datos de la BNE se ha podido conocer este modelo del lenguaje.',
73
+ 'token': 1910,
74
+ 'token_str': ' conocer'},
75
+ {'score': 0.0492338091135025,
76
+ 'sequence': ' Gracias a los datos de la BNE se ha podido realizar este modelo del lenguaje.',
77
+ 'token': 2178,
78
+ 'token_str': ' realizar'},
79
+ {'score': 0.03890657424926758,
80
+ 'sequence': ' Gracias a los datos de la BNE se ha podido reconstruir este modelo del lenguaje.',
81
+ 'token': 23368,
82
+ 'token_str': ' reconstruir'},
83
+ {'score': 0.03662774711847305,
84
+ 'sequence': ' Gracias a los datos de la BNE se ha podido desarrollar este modelo del lenguaje.',
85
+ 'token': 3815,
86
+ 'token_str': ' desarrollar'},
87
+ {'score': 0.030557377263903618,
88
+ 'sequence': ' Gracias a los datos de la BNE se ha podido estudiar este modelo del lenguaje.',
89
+ 'token': 6361,
90
+ 'token_str': ' estudiar'}]
91
+ ```
92
+ Here is how to use this model to get the features of a given text in PyTorch:
93
+
94
+ ```python
95
+ >>> from transformers import RobertaTokenizer, RobertaModel
96
+ >>> tokenizer = RobertaTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
97
+ >>> model = RobertaModel.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
98
+ >>> text = "Gracias a los datos de la BNE se ha podido desarrollar este modelo del lenguaje."
99
+ >>> encoded_input = tokenizer(text, return_tensors='pt')
100
+ >>> output = model(**encoded_input)
101
+ >>> print(output.last_hidden_state.shape)
102
+ torch.Size([1, 19, 1024])
103
+ ```
104
+
105
+ ## Intended Uses and Limitations
106
+ You can use the raw model for fill mask or fine-tune it to a downstream task.
107
+
108
+ The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
109
+ unfiltered content from the internet, which is far from neutral.
110
+
111
+ ## Training
112
+
113
+ ### Training Data
114
+
115
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
116
 
117
  To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among the others, sentence splitting, language detection, filtering of bad-formed sentences and deduplication of repetitive contents. During the process document boundaries are kept. This resulted into 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting into 570GB of text.
 
122
  |---------|---------------------|------------------|-----------|
123
  | BNE | 201,080,084 | 135,733,450,668 | 570GB |
124
 
125
+ ### Training Procedure
126
+ The configuration of the **RoBERTa-large-bne** model is as follows:
127
+ - RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
128
+ The pretraining objective used for this architecture is masked language modeling without next sentence prediction.
129
+ The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original [RoBERTA](https://arxiv.org/abs/1907.11692) model with a vocabulary size of 50,262 tokens.
130
+ The RoBERTa-large-bne pre-training consists of a masked language model training that follows the approach employed for the RoBERTa base. The training lasted a total of 96 hours with 32 computing nodes each one with 4 NVIDIA V100 GPUs of 16GB VRAM.
131
+
132
+ ## Evaluation
133
+
134
+ ### Evaluation Results
135
+ When fine-tuned on downstream tasks, this model achieves the following results:
136
+ | Dataset | Metric | [**RoBERTa-l**](https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne) |
137
+ |--------------|----------|------------|
138
+ | MLDoc | F1 | 0.9702 |
139
+ | CoNLL-NERC | F1 | 0.8823 |
140
+ | CAPITEL-NERC | F1 | 0.9051 |
141
+ | PAWS-X | F1 | 0.9150 |
142
+ | UD-POS | F1 | 0.9904 |
143
+ | CAPITEL-POS | F1 | 0.9856 |
144
+ | SQAC | F1 | 0.8202 |
145
+ | STS | Combined | 0.8411 |
146
+ | XNLI | Accuracy | 0.8263 |
147
+
148
+ For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
149
 
150
+ ## Additional Information
 
151
 
152
+ ### Authors
153
+
154
+ The Text Mining Unit from Barcelona Supercomputing Center.
155
+
156
+ ### Citation Information
157
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
158
  ```
159
  @article{,
 
171
  url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
172
  year = {2022},
173
  }
 
174
  ```
 
175
 
176
+ ### Contact Information
177
 
178
+ For further information, send an email to <plantl-gob-es@bsc.es>
179
 
180
+ ### Funding
 
 
181
 
182
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
183
 
184
+ ### Licensing Information
185
+
186
+ This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
187
+
188
+ ### Copyright
189
+
190
+ Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
191
+
192
+ ### Disclaimer
193
+
194
+ <details>
195
+ <summary>Click to expand</summary>
196
 
197
  The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
198
 
199
+ When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
200
 
201
+ In no event shall the owner of the models (SEDIA – State Secretariat for Digitalization and Artificial Intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
202
 
203
 
204
  Los modelos publicados en este repositorio tienen una finalidad generalista y están a disposición de terceros. Estos modelos pueden tener sesgos y/u otro tipo de distorsiones indeseables.
205
 
206
  Cuando terceros desplieguen o proporcionen sistemas y/o servicios a otras partes usando alguno de estos modelos (o utilizando sistemas basados en estos modelos) o se conviertan en usuarios de los modelos, deben tener en cuenta que es su responsabilidad mitigar los riesgos derivados de su uso y, en todo caso, cumplir con la normativa aplicable, incluyendo la normativa en materia de uso de inteligencia artificial.
207
 
208
+ En ningún caso el propietario de los modelos (SEDIA – Secretaría de Estado de Digitalización e Inteligencia Artificial) ni el creador (BSC – Barcelona Supercomputing Center) serán responsables de los resultados derivados del uso que hagan terceros de estos modelos.
209
+ </details>