PlanTL-GOB-ES
/

roberta-base-biomedical-es

@@ -19,21 +19,20 @@ widget:
 <details>
 <summary>Click to expand</summary>
-- [Model Description](#model-description)
-- [Intended Uses and Limitations](#intended-use)
-- [How to Use](#how-to-use)
 - [Limitations and bias](#limitations-and-bias)
 - [Training](#training)
   - [Tokenization and model pretraining](#Tokenization-pretraining)
   - [Training corpora and preprocessing](#training-corpora-preprocessing)
-- [Evaluation and results](#evaluation)
-- [Additional Information](#additional-information)
-  - [Contact Information](#contact-information)
   - [Copyright](#copyright)
-  - [Licensing Information](#licensing-information)
   - [Funding](#funding)
-  - [Citation Information](#citation-information)
-  - [Contributions](#contributions)
   - [Disclaimer](#disclaimer)
 </details>
@@ -41,26 +40,18 @@ widget:
 ## Model description
 Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
-## Intended uses & limitations
-The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
-However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 ## How to use
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
 model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
 from transformers import pipeline
 unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
 unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
 ```
 ```
@@ -105,6 +96,7 @@ unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
 **biomedical** corpus in Spanish collected from several sources (see next section).
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
 used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
@@ -139,8 +131,7 @@ The result is a medium-size biomedical corpus for Spanish composed of about 963M
-## Evaluation and results
 The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
@@ -161,24 +152,23 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
 ## Additional information
-### Contact Information
 For further information, send an email to <plantl-gob-es@bsc.es>
 ### Copyright
 Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
 ### Licensing information
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ### Funding
 This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
-## Citation Information
 If you use our models, please cite our latest preprint:
 ```bibtex
@@ -210,11 +200,6 @@ If you use our Medical Crawler corpus, please cite the preprint:
 ```
-### Contributions
-[N/A]
 ### Disclaimer
 <details>

 <details>
 <summary>Click to expand</summary>
+- [Model description](#model-description)
+- [Intended uses and limitations](#intended-use)
+- [How to use](#how-to-use)
 - [Limitations and bias](#limitations-and-bias)
 - [Training](#training)
   - [Tokenization and model pretraining](#Tokenization-pretraining)
   - [Training corpora and preprocessing](#training-corpora-preprocessing)
+- [Evaluation](#evaluation)
+- [Additional information](#additional-information)
+  - [Author](#author)
+  - [Contact information](#contact-information)
   - [Copyright](#copyright)
+  - [Licensing information](#licensing-information)
   - [Funding](#funding)
   - [Disclaimer](#disclaimer)
 </details>
 ## Model description
 Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
+## Intended uses and limitations
+The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section). However, it is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 ## How to use
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
 model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
 from transformers import pipeline
 unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
 unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
 ```
 ```
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
 **biomedical** corpus in Spanish collected from several sources (see next section).
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
 used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
+## Evaluation
 The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
 ## Additional information
+### Author
+Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
+### Contact information
 For further information, send an email to <plantl-gob-es@bsc.es>
 ### Copyright
 Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
 ### Licensing information
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ### Funding
 This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
+## Citation information
 If you use our models, please cite our latest preprint:
 ```bibtex
 ```
 ### Disclaimer
 <details>