ccasimiro commited on
Commit
5a46ae0
1 Parent(s): 7ab93ed

update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md CHANGED
@@ -13,3 +13,128 @@ widget:
13
  - text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
14
  ---
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
14
  ---
15
 
16
+ # Biomedical language model for Spanish
17
+
18
+ ## BibTeX citation
19
+
20
+ If you use any of these resources (datasets or models) in your work, please cite our latest paper:
21
+
22
+ ```bibtex
23
+ @misc{carrino2021biomedical,
24
+ title={Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario},
25
+ author={Casimiro Pio Carrino and Jordi Armengol-Estapé and Asier Gutiérrez-Fandiño and Joan Llop-Palao and Marc Pàmies and Aitor Gonzalez-Agirre and Marta Villegas},
26
+ year={2021},
27
+ eprint={2109.03570},
28
+ archivePrefix={arXiv},
29
+ primaryClass={cs.CL}
30
+ }
31
+ ```
32
+
33
+ ## Model and tokenization
34
+ This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
35
+ biomedical corpus collected from several sources (see next section).
36
+
37
+ ## Training corpora and preprocessing
38
+
39
+ The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers:
40
+
41
+ | Name | No. tokens | Description |
42
+ |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
43
+ | [Medical crawler](https://zenodo.org/record/4561971#.YTtwM32xXbQ) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish health domain |
44
+ | Scielo | 60,007,289 | Collection of biomedical literature in Spanish crawled from the Scielo repository in 2019 |
45
+ | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines |
46
+ | Wikipedia_life_sciences | 13,890,501 | Wikipedia articles beloging to the Life Sciences category crawled on 04/01/2021 |
47
+ | Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P" |
48
+ | [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from the a parallel corpus made out of PDF documents from the European Medicines Agency. |
49
+ | [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side documents extracted from a collection of Spanish-English parallel corpora consistiing of biomedical scientific literature. The collection of parallel resources are aggregated from the IBECS, SciELO, Pubmed and MedlinePlus sources. |
50
+ | PubMed | 1,858,966 | Collection of biomedical literature in Spanish crawled from the PubMed repository in 2019 |
51
+
52
+ To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
53
+
54
+ - data parsing in different formats
55
+ - sentence splitting
56
+ - language detection
57
+ - filtering of ill-formed sentences
58
+ - deduplication of repetitive contents
59
+ - keep the original document boundaries
60
+
61
+ Finally, the corpora are concatenated and further global deduplication among the corpora have been applied.
62
+ The result is a medium-size biomedical corpus for Spanish composed of about 860M tokens.
63
+
64
+ ## Evaluation and results
65
+
66
+ The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
67
+
68
+ - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
69
+
70
+ - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
71
+
72
+ - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
73
+
74
+ The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
75
+
76
+ | F1 - Precision - Recall | roberta-base-biomedical-es | mBERT | BETO |
77
+ |---------------------------|----------------------------|-------------------------------|-------------------------|
78
+ | PharmaCoNER | **89.48** - **87.85** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
79
+ | CANTEMIST | **83.87** - **81.70** - **86.17** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
80
+ | ICTUSnet | **88.12** - **85.56** - **90.83** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
81
+
82
+
83
+ ## Intended uses & limitations
84
+
85
+ The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
86
+
87
+ However, the is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
88
+
89
+ ---
90
+
91
+ ## How to use
92
+
93
+ ```python
94
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
95
+
96
+ tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
97
+
98
+ model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
99
+
100
+ from transformers import pipeline
101
+
102
+ unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
103
+
104
+ unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
105
+ ```
106
+ ```
107
+ # Output
108
+ [
109
+ {
110
+ "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
111
+ "score": 0.9855039715766907,
112
+ "token": 3529,
113
+ "token_str": " hipertensión"
114
+ },
115
+ {
116
+ "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
117
+ "score": 0.0039140828885138035,
118
+ "token": 1945,
119
+ "token_str": " diabetes"
120
+ },
121
+ {
122
+ "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
123
+ "score": 0.002484665485098958,
124
+ "token": 11483,
125
+ "token_str": " hipotensión"
126
+ },
127
+ {
128
+ "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
129
+ "score": 0.0023484621196985245,
130
+ "token": 12238,
131
+ "token_str": " Hipertensión"
132
+ },
133
+ {
134
+ "sequence": " El único antecedente personal a reseñar era la presión arterial.",
135
+ "score": 0.0008009297889657319,
136
+ "token": 2267,
137
+ "token_str": " presión"
138
+ }
139
+ ]
140
+ ```