PlanTL-GOB-ES
/

roberta-base-bne

@@ -33,22 +33,21 @@ widget:
 <summary>Click to expand</summary>
 - [Overview](#overview)
-- [Model Description](#model-description)
-- [Intended Uses and Limitations](#intended-uses-and-limitations)
-- [How to Use](#how-to-use)
 - [Limitations and bias](#limitations-and-bias)
 - [Training](#training)
-  - [Training Data](#training-data)
-  - [Training Procedure](#training-procedure)
 - [Evaluation](#evaluation)
-   - [Evaluation Results](#evaluation-results)
-- [Additional Information](#additional-information)
-  - [Contact Information](#contact-information)
   - [Copyright](#copyright)
-  - [Licensing Information](#licensing-information)
   - [Funding](#funding)
   - [Citation Information](#citation-information)
-  - [Contributions](#contributions)
   - [Disclaimer](#disclaimer)
 </details>
@@ -59,18 +58,13 @@ widget:
 - **Task:** fill-mask
 - **Data:** BNE
-## Model Description
 RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the  [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
-## Intended Uses and Limitations
 You can use the raw model for fill mask or fine-tune it to a downstream task.
-## How to Use
 You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
 ```python
@@ -167,12 +161,11 @@ At the time of submission, no measures have been taken to estimate the bias and
   'sequence': 'Mohammed está pensando en ello.',
   'token': 1577,
   'token_str': ' ello'}]
 ```
 ## Training
-### Training Data
 The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
@@ -184,7 +177,8 @@ Some of the statistics of the corpus:
 |---------|---------------------|------------------|-----------|
 | BNE     |         201,080,084 |  135,733,450,668 |     570GB |
-### Training Procedure
 The configuration of the **RoBERTa-base-bne** model is as follows:
  - RoBERTa-b: 12-layer, 768-hidden, 12-heads, 125M parameters.
@@ -194,8 +188,8 @@ The RoBERTa-base-bne pre-training consists of a masked language model training t
 ## Evaluation
-### Evaluation Results
 When fine-tuned on downstream tasks, this model achieves the following results:
 | Dataset      | Metric   | [**RoBERTa-b**](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne)   |
 |--------------|----------|------------|
 | MLDoc        | F1       |     0.9664 |
@@ -210,25 +204,24 @@ When fine-tuned on downstream tasks, this model achieves the following results:
 For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
-## Additional Information
-### Contact Information
 For further information, send an email to <plantl-gob-es@bsc.es>
 ### Copyright
 Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
-### Licensing Information
 This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ### Funding
 This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
-### Citation Information
 If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
 ```
 @article{,
@@ -246,13 +239,8 @@ Intelligence (SEDIA) within the framework of the Plan-TL.},
    url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
    year = {2022},
 }
 ```
-### Contributions
-[N/A]
 ### Disclaimer
 <details>

 <summary>Click to expand</summary>
 - [Overview](#overview)
+- [Model description](#model-description)
+- [Intended uses and limitations](#intended-uses-and-limitations)
+- [How to use](#how-to-use)
 - [Limitations and bias](#limitations-and-bias)
 - [Training](#training)
+  - [Training data](#training-data)
+  - [Training procedure](#training-procedure)
 - [Evaluation](#evaluation)
+- [Additional information](#additional-information)
+  - [Author](#author)
+  - [Contact information](#contact-information)
   - [Copyright](#copyright)
+  - [Licensing information](#licensing-information)
   - [Funding](#funding)
   - [Citation Information](#citation-information)
   - [Disclaimer](#disclaimer)
 </details>
 - **Task:** fill-mask
 - **Data:** BNE
+## Model description
 RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the [RoBERTa](https://arxiv.org/abs/1907.11692) base model and has been pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the  [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) from 2009 to 2019.
+## Intended uses and limitations
 You can use the raw model for fill mask or fine-tune it to a downstream task.
+## How to use
 You can use this model directly with a pipeline for fill mask. Since the generation relies on some randomness, we set a seed for reproducibility:
 ```python
   'sequence': 'Mohammed está pensando en ello.',
   'token': 1577,
   'token_str': ' ello'}]
 ```
 ## Training
+### Training data
 The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
 |---------|---------------------|------------------|-----------|
 | BNE     |         201,080,084 |  135,733,450,668 |     570GB |
+### Training procedure
 The configuration of the **RoBERTa-base-bne** model is as follows:
  - RoBERTa-b: 12-layer, 768-hidden, 12-heads, 125M parameters.
 ## Evaluation
 When fine-tuned on downstream tasks, this model achieves the following results:
 | Dataset      | Metric   | [**RoBERTa-b**](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne)   |
 |--------------|----------|------------|
 | MLDoc        | F1       |     0.9664 |
 For more evaluation details visit our [GitHub repository](https://github.com/PlanTL-GOB-ES/lm-spanish) or [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405).
+## Additional information
+### Author
+Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
+### Contact information
 For further information, send an email to <plantl-gob-es@bsc.es>
 ### Copyright
 Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
+### Licensing information
 This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ### Funding
 This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
+### Citation information
 If you use this model, please cite our [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
 ```
 @article{,
    url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
    year = {2022},
 }
 ```
 ### Disclaimer
 <details>