mmarimon commited on
Commit
54daa54
1 Parent(s): 7926f83

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -31
README.md CHANGED
@@ -33,22 +33,22 @@ widget:
33
 
34
  - [Overview](#overview)
35
  - [Model Description](#model-description)
36
- - [How to Use](#how-to-use)
37
  - [Intended Uses and Limitations](#intended-uses-and-limitations)
 
38
  - [Training](#training)
39
  - [Training Data](#training-data)
40
  - [Training Procedure](#training-procedure)
41
  - [Evaluation](#evaluation)
42
  - [Evaluation Results](#evaluation-results)
43
  - [Additional Information](#additional-information)
44
- - [Authors](#authors)
45
- - [Citation Information](#citation-information)
46
- - [Contact Information](#contact-information)
47
- - [Funding](#funding)
48
- - [Licensing Information](#licensing-information)
49
- - [Copyright](#copyright)
50
- - [Disclaimer](#disclaimer)
51
-
52
  </details>
53
 
54
  ## Overview
@@ -60,6 +60,13 @@ widget:
60
  ## Model Description
61
  RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
62
 
 
 
 
 
 
 
 
63
  ## How to Use
64
  You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
65
 
@@ -102,12 +109,6 @@ Here is how to use this model to get the features of a given text in PyTorch:
102
  torch.Size([1, 19, 1024])
103
  ```
104
 
105
- ## Intended Uses and Limitations
106
- You can use the raw model for fill mask or fine-tune it to a downstream task.
107
-
108
- The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
109
- unfiltered content from the internet, which is far from neutral. At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
110
-
111
  ## Training
112
 
113
  ### Training Data
@@ -150,11 +151,24 @@ For more evaluation details visit our [GitHub repository](https://github.com/Pla
150
 
151
  ## Additional Information
152
 
153
- ### Authors
154
 
155
- The Text Mining Unit from Barcelona Supercomputing Center.
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
  ### Citation Information
 
158
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
159
  ```
160
  @article{,
@@ -174,21 +188,9 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
174
  }
175
  ```
176
 
177
- ### Contact Information
178
-
179
- For further information, send an email to <plantl-gob-es@bsc.es>
180
-
181
- ### Funding
182
-
183
- This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
184
-
185
- ### Licensing Information
186
 
187
- This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
188
-
189
- ### Copyright
190
-
191
- Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
192
 
193
  ### Disclaimer
194
 
 
33
 
34
  - [Overview](#overview)
35
  - [Model Description](#model-description)
 
36
  - [Intended Uses and Limitations](#intended-uses-and-limitations)
37
+ - [How to Use](#how-to-use)
38
  - [Training](#training)
39
  - [Training Data](#training-data)
40
  - [Training Procedure](#training-procedure)
41
  - [Evaluation](#evaluation)
42
  - [Evaluation Results](#evaluation-results)
43
  - [Additional Information](#additional-information)
44
+ - [Contact Information](#contact-information)
45
+ - [Copyright](#copyright)
46
+ - [Licensing Information](#licensing-information)
47
+ - [Funding](#funding)
48
+ - [Citation Information](#citation-information)
49
+ - [Contributions](#contributions)
50
+ - [Disclaimer](#disclaimer)
51
+
52
  </details>
53
 
54
  ## Overview
 
60
  ## Model Description
61
  RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
62
 
63
+
64
+ ## Intended Uses and Limitations
65
+ You can use the raw model for fill mask or fine-tune it to a downstream task.
66
+
67
+ The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
68
+ unfiltered content from the internet, which is far from neutral. At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
69
+
70
  ## How to Use
71
  You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
72
 
 
109
  torch.Size([1, 19, 1024])
110
  ```
111
 
 
 
 
 
 
 
112
  ## Training
113
 
114
  ### Training Data
 
151
 
152
  ## Additional Information
153
 
154
+ ### Contact Information
155
 
156
+ For further information, send an email to <plantl-gob-es@bsc.es>
157
+
158
+ ### Copyright
159
+
160
+ Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
161
+
162
+ ### Licensing Information
163
+
164
+ This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
165
+
166
+ ### Funding
167
+
168
+ This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
169
 
170
  ### Citation Information
171
+
172
  If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
173
  ```
174
  @article{,
 
188
  }
189
  ```
190
 
191
+ ### Contributions
 
 
 
 
 
 
 
 
192
 
193
+ [N/A]
 
 
 
 
194
 
195
  ### Disclaimer
196