Update README.md
Browse files
README.md
CHANGED
@@ -32,22 +32,20 @@ widget:
|
|
32 |
<summary>Click to expand</summary>
|
33 |
|
34 |
- [Overview](#overview)
|
35 |
-
- [Model
|
36 |
-
- [Intended
|
37 |
-
- [How to
|
38 |
- [Limitations and bias](#limitations-and-bias)
|
39 |
- [Training](#training)
|
40 |
-
- [Training
|
41 |
-
- [Training
|
42 |
- [Evaluation](#evaluation)
|
43 |
-
|
44 |
-
- [
|
45 |
-
- [Contact Information](#contact-information)
|
46 |
- [Copyright](#copyright)
|
47 |
-
- [Licensing
|
48 |
- [Funding](#funding)
|
49 |
- [Citation Information](#citation-information)
|
50 |
-
- [Contributions](#contributions)
|
51 |
- [Disclaimer](#disclaimer)
|
52 |
|
53 |
</details>
|
@@ -58,13 +56,13 @@ widget:
|
|
58 |
- **Task:** fill-mask
|
59 |
- **Data:** BNE
|
60 |
|
61 |
-
## Model
|
62 |
RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
|
63 |
|
64 |
-
## Intended
|
65 |
You can use the raw model for fill mask or fine-tune it to a downstream task.
|
66 |
|
67 |
-
## How to
|
68 |
You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
|
69 |
|
70 |
```python
|
@@ -112,7 +110,7 @@ At the time of submission, no measures have been taken to estimate the bias and
|
|
112 |
|
113 |
## Training
|
114 |
|
115 |
-
### Training
|
116 |
|
117 |
The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
|
118 |
|
@@ -124,7 +122,7 @@ Some of the statistics of the corpus:
|
|
124 |
|---------|---------------------|------------------|-----------|
|
125 |
| BNE | 201,080,084 | 135,733,450,668 | 570GB |
|
126 |
|
127 |
-
### Training
|
128 |
The configuration of the **RoBERTa-large-bne** model is as follows:
|
129 |
- RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
|
130 |
|
@@ -134,7 +132,6 @@ The RoBERTa-large-bne pre-training consists of a masked language model training
|
|
134 |
|
135 |
## Evaluation
|
136 |
|
137 |
-
### Evaluation Results
|
138 |
When fine-tuned on downstream tasks, this model achieves the following results:
|
139 |
| Dataset | Metric | [**RoBERTa-l**](https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne) |
|
140 |
|--------------|----------|------------|
|
@@ -150,25 +147,24 @@ When fine-tuned on downstream tasks, this model achieves the following results:
|
|
150 |
|
151 |
For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
|
152 |
|
153 |
-
## Additional
|
154 |
|
155 |
-
###
|
|
|
156 |
|
|
|
157 |
For further information, send an email to <plantl-gob-es@bsc.es>
|
158 |
|
159 |
### Copyright
|
160 |
-
|
161 |
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
162 |
|
163 |
-
### Licensing
|
164 |
-
|
165 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
166 |
|
167 |
### Funding
|
168 |
-
|
169 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
170 |
|
171 |
-
### Citation
|
172 |
|
173 |
If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
|
174 |
```
|
@@ -189,10 +185,6 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
|
|
189 |
}
|
190 |
```
|
191 |
|
192 |
-
### Contributions
|
193 |
-
|
194 |
-
[N/A]
|
195 |
-
|
196 |
### Disclaimer
|
197 |
|
198 |
<details>
|
|
|
32 |
<summary>Click to expand</summary>
|
33 |
|
34 |
- [Overview](#overview)
|
35 |
+
- [Model description](#model-description)
|
36 |
+
- [Intended uses and limitations](#intended-uses-and-limitations)
|
37 |
+
- [How to use](#how-to-use)
|
38 |
- [Limitations and bias](#limitations-and-bias)
|
39 |
- [Training](#training)
|
40 |
+
- [Training data](#training-data)
|
41 |
+
- [Training procedure](#training-procedure)
|
42 |
- [Evaluation](#evaluation)
|
43 |
+
- [Additional information](#additional-information)
|
44 |
+
- [Contact information](#contact-information)
|
|
|
45 |
- [Copyright](#copyright)
|
46 |
+
- [Licensing information](#licensing-information)
|
47 |
- [Funding](#funding)
|
48 |
- [Citation Information](#citation-information)
|
|
|
49 |
- [Disclaimer](#disclaimer)
|
50 |
|
51 |
</details>
|
|
|
56 |
- **Task:** fill-mask
|
57 |
- **Data:** BNE
|
58 |
|
59 |
+
## Model description
|
60 |
RoBERTa-large-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) large model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
|
61 |
|
62 |
+
## Intended uses and limitations
|
63 |
You can use the raw model for fill mask or fine-tune it to a downstream task.
|
64 |
|
65 |
+
## How to use
|
66 |
You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
|
67 |
|
68 |
```python
|
|
|
110 |
|
111 |
## Training
|
112 |
|
113 |
+
### Training data
|
114 |
|
115 |
The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
|
116 |
|
|
|
122 |
|---------|---------------------|------------------|-----------|
|
123 |
| BNE | 201,080,084 | 135,733,450,668 | 570GB |
|
124 |
|
125 |
+
### Training procedure
|
126 |
The configuration of the **RoBERTa-large-bne** model is as follows:
|
127 |
- RoBERTa-l: 24-layer, 1024-hidden, 16-heads, 355M parameters.
|
128 |
|
|
|
132 |
|
133 |
## Evaluation
|
134 |
|
|
|
135 |
When fine-tuned on downstream tasks, this model achieves the following results:
|
136 |
| Dataset | Metric | [**RoBERTa-l**](https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne) |
|
137 |
|--------------|----------|------------|
|
|
|
147 |
|
148 |
For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
|
149 |
|
150 |
+
## Additional information
|
151 |
|
152 |
+
### Author
|
153 |
+
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
|
154 |
|
155 |
+
### Contact information
|
156 |
For further information, send an email to <plantl-gob-es@bsc.es>
|
157 |
|
158 |
### Copyright
|
|
|
159 |
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
160 |
|
161 |
+
### Licensing information
|
|
|
162 |
This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
163 |
|
164 |
### Funding
|
|
|
165 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
166 |
|
167 |
+
### Citation information
|
168 |
|
169 |
If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
|
170 |
```
|
|
|
185 |
}
|
186 |
```
|
187 |
|
|
|
|
|
|
|
|
|
188 |
### Disclaimer
|
189 |
|
190 |
<details>
|