PlanTL-GOB-ES
/

bsc-bio-ehr-es

@@ -56,23 +56,26 @@ Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus
 ## Evaluation and results
-The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
  - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
-The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
-| F1 - Precision - Recall | roberta-base-biomedical-clinical-es | mBERT                   | BETO                    |
-|---------------------------|----------------------------|-------------------------------|-------------------------|
-| PharmaCoNER               | **90.04** - **88.92** - **91.18**    | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
-| CANTEMIST                 | **83.34** - **81.48** - **85.30**    | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
-| ICTUSnet                  | **88.08** - **84.92** - **91.50**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
 ## Intended uses & limitations
 The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
@@ -84,59 +87,6 @@ To be announced soon!
 ---
----
-## How to use
-```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
-model = AutoModelForMaskedLM.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
-from transformers import pipeline
-unmasker = pipeline('fill-mask', model="PlanTL-GOB-ES/roberta-base-biomedical-es")
-unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
-```
-```
-# Output
-[
-  {
-    "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
-    "score": 0.9855039715766907,
-    "token": 3529,
-    "token_str": " hipertensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
-    "score": 0.0039140828885138035,
-    "token": 1945,
-    "token_str": " diabetes"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
-    "score": 0.002484665485098958,
-    "token": 11483,
-    "token_str": " hipotensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
-    "score": 0.0023484621196985245,
-    "token": 12238,
-    "token_str": " Hipertensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la presión arterial.",
-    "score": 0.0008009297889657319,
-    "token": 2267,
-    "token_str": " presión"
-  }
-]
-```
 ## Funding
 This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.

 ## Evaluation and results
+The model has been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
  - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
+We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
+The table below shows the F1 scores obtained:
+| Tasks/Models | bsc-bio-ehr-es | XLM-R-Galén        | BETO-Galén   | mBERT-Galén  | mBERT        | BioBERT      | roberta-base-bne |
+|--------------|----------------|--------------------|--------------|--------------|--------------|--------------|------------------|
+| PharmaCoNER  |  **0.8913**   | 0.8754       | 0.8537 | 0.8594 | 0.8671 | 0.8545 | 0.8474     |
+| CANTEMIST    | **0.8340**    | 0.8078 | 0.8153 | 0.8168 | 0.8116 | 0.8070  | 0.7875     |
+| ICTUSnet     |  **0.8756**   | 0.8716       | 0.8498 | 0.8509 | 0.8631 | 0.8521 | 0.8677 |
+The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
 ## Intended uses & limitations
 The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
 ---
 ## Funding
 This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.