PlanTL-GOB-ES
/

bsc-bio-es

@@ -15,10 +15,49 @@ widget:
 ---
 # Biomedical language model for Spanish
 Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
-## Tokenization and model pretraining
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
 **biomedical** corpus in Spanish collected from several sources (see next section).
@@ -26,7 +65,7 @@ The training corpus has been tokenized using a byte version of [Byte-Pair Encodi
 used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
-## Training corpora and preprocessing
 The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
 To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
@@ -67,6 +106,7 @@ The model has been fine-tuned on three Named Entity Recognition (NER) tasks usin
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
 We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
 The table below shows the F1 scores obtained:
 | Tasks/Models | bsc-bio-es  | XLM-R-Galén        | BETO-Galén   | mBERT-Galén  | mBERT        | BioBERT      | roberta-base-bne |
@@ -78,13 +118,25 @@ The table below shows the F1 scores obtained:
 The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
-## Intended uses & limitations
-The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
-However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
-## Cite
 If you use these models, please cite our work:
 ```bibtext
@@ -112,17 +164,11 @@ If you use these models, please cite our work:
 ```
 ---
-## Copyright
-Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
-## Licensing information
-[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-## Funding
-This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
 ### Disclaimer

 ---
 # Biomedical language model for Spanish
+## Table of contents
+<details>
+<summary>Click to expand</summary>
+- [Model Description](#model-description)
+- [Intended Uses and Limitations](#intended-use)
+- [How to Use](#how-to-use)
+- [Limitations and bias](#limitations-and-bias)
+- [Training](#training)
+  - [Training Data](#training-data)
+  - [Training Procedure](#training-procedure)
+- [Evaluation](#evaluation)
+- [Additional Information](#additional-information)
+  - [Contact Information](#contact-information)
+  - [Copyright](#copyright)
+  - [Licensing Information](#licensing-information)
+  - [Funding](#funding)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+  - [Disclaimer](#disclaimer)
+</details>
+## Model description
 Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
+## Intended uses & limitations
+The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
+However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
+## How to Use
+## Limitations and bias
+## Training
+### Tokenization and model pretraining
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
 **biomedical** corpus in Spanish collected from several sources (see next section).
 used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
+### Training corpora and preprocessing
 The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
 To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
 We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
 The table below shows the F1 scores obtained:
 | Tasks/Models | bsc-bio-es  | XLM-R-Galén        | BETO-Galén   | mBERT-Galén  | mBERT        | BioBERT      | roberta-base-bne |
 The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
+## Additional information
+### Contact Information
+For further information, send an email to <plantl-gob-es@bsc.es>
+### Copyright
+Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
+### Licensing information
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
+### Cite
 If you use these models, please cite our work:
 ```bibtext
 ```
 ---
+### Contributions
+[N/A]
 ### Disclaimer