ccasimiro commited on
Commit
078ca48
1 Parent(s): 3501e72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -61
README.md CHANGED
@@ -56,23 +56,26 @@ Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus
56
 
57
  ## Evaluation and results
58
 
59
- The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
 
60
 
61
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
62
 
63
  - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
64
 
65
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
 
 
 
66
 
67
- The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
68
-
69
- | F1 - Precision - Recall | roberta-base-biomedical-clinical-es | mBERT | BETO |
70
- |---------------------------|----------------------------|-------------------------------|-------------------------|
71
- | PharmaCoNER | **90.04** - **88.92** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
72
- | CANTEMIST | **83.34** - **81.48** - **85.30** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
73
- | ICTUSnet | **88.08** - **84.92** - **91.50** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
74
 
75
 
 
76
  ## Intended uses & limitations
77
 
78
  The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
@@ -84,59 +87,6 @@ To be announced soon!
84
 
85
  ---
86
 
87
- ---
88
-
89
- ## How to use
90
-
91
- ```python
92
- from transformers import AutoTokenizer, AutoModelForMaskedLM
93
-
94
- tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
95
-
96
- model = AutoModelForMaskedLM.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
97
-
98
- from transformers import pipeline
99
-
100
- unmasker = pipeline('fill-mask', model="PlanTL-GOB-ES/roberta-base-biomedical-es")
101
-
102
- unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
103
- ```
104
- ```
105
- # Output
106
- [
107
- {
108
- "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
109
- "score": 0.9855039715766907,
110
- "token": 3529,
111
- "token_str": " hipertensión"
112
- },
113
- {
114
- "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
115
- "score": 0.0039140828885138035,
116
- "token": 1945,
117
- "token_str": " diabetes"
118
- },
119
- {
120
- "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
121
- "score": 0.002484665485098958,
122
- "token": 11483,
123
- "token_str": " hipotensión"
124
- },
125
- {
126
- "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
127
- "score": 0.0023484621196985245,
128
- "token": 12238,
129
- "token_str": " Hipertensión"
130
- },
131
- {
132
- "sequence": " El único antecedente personal a reseñar era la presión arterial.",
133
- "score": 0.0008009297889657319,
134
- "token": 2267,
135
- "token_str": " presión"
136
- }
137
- ]
138
- ```
139
-
140
  ## Funding
141
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
142
 
 
56
 
57
  ## Evaluation and results
58
 
59
+
60
+ The model has been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
61
 
62
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
63
 
64
  - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
65
 
66
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
67
+
68
+ We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
69
+ The table below shows the F1 scores obtained:
70
 
71
+ | Tasks/Models | bsc-bio-ehr-es | XLM-R-Galén | BETO-Galén | mBERT-Galén | mBERT | BioBERT | roberta-base-bne |
72
+ |--------------|----------------|--------------------|--------------|--------------|--------------|--------------|------------------|
73
+ | PharmaCoNER | **0.8913** | 0.8754 | 0.8537 | 0.8594 | 0.8671 | 0.8545 | 0.8474 |
74
+ | CANTEMIST | **0.8340** | 0.8078 | 0.8153 | 0.8168 | 0.8116 | 0.8070 | 0.7875 |
75
+ | ICTUSnet | **0.8756** | 0.8716 | 0.8498 | 0.8509 | 0.8631 | 0.8521 | 0.8677 |
 
 
76
 
77
 
78
+ The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
79
  ## Intended uses & limitations
80
 
81
  The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
 
87
 
88
  ---
89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ## Funding
91
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
92