IDelaIglesia commited on
Commit
8daedb8
1 Parent(s): 60218d5

Model Upload

Browse files
README.md CHANGED
@@ -1,3 +1,205 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - es
6
+ datasets:
7
+ - HiTZ/Multilingual-Medical-Corpus
8
+ tags:
9
+ - biomedical
10
+ - medical
11
+ - clinical
12
+ - spanish
13
+ - multilingual
14
+ widget:
15
+ - text: "Las radiologías óseas de cuerpo entero no detectan alteraciones <mask>, ni alteraciones vertebrales."
16
+ - text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
17
+ - text: "Percutaneous transluminal coronary <mask> of both the LAD and Cx was done."
18
+ - text: "Clinical examination showed a non-pulsatile, painless, axial, irreducible exophthalmia with no sign of <mask> or keratitis, and right monocular blindness, right ptosis."
19
  ---
20
+
21
+ <br>
22
+ <div style="text-align: center;">
23
+ <img src="https://huggingface.co/HiTZ/EriBERTa-base/resolve/main/eriberta_icon.png" style="height: 175px;display: block;margin-left: auto;margin-right: auto;">
24
+ </div>
25
+
26
+ <h1 style="text-align: center;">
27
+ <b>EriBERTa</b>
28
+ <br>
29
+ A Bilingual Pre-Trained Language Model
30
+ <br>
31
+ for Clinical Natural Language Processing
32
+ </h1>
33
+
34
+ <br>
35
+
36
+ <p style="text-align: justify;">
37
+ We introduce EriBERTa, a bilingual domain-specific language model pre-trained on extensive medical and clinical corpora. We demonstrate that EriBERTa outperforms previous Spanish language models in the clinical domain, showcasing its superior capabilities in understanding medical texts and extracting meaningful information.
38
+ Moreover, EriBERTa exhibits promising transfer learning abilities, allowing for knowledge transfer from one language to another. This aspect is particularly beneficial given the scarcity of Spanish clinical data.
39
+ </p>
40
+
41
+ - 📖 Paper: [EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing](https://arxiv.org/abs/2306.07373)
42
+
43
+
44
+ ## How to Get Started with the Model
45
+
46
+ You can load the model using:
47
+ ```python
48
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained("HiTZ/EriBERTa-base")
51
+ model = AutoModelForMaskedLM.from_pretrained("HiTZ/EriBERTa-base")
52
+ ```
53
+
54
+
55
+ # Model Description
56
+
57
+ - **Developed by**: Iker De la Iglesia, Aitziber Atutxa, Koldo Gojenola, and Ander Barrena
58
+ - **Contact**: [Iker De la Iglesia](mailto:iker.delaiglesia@ehu.eus) and [Ander Barrena](mailto:ander.barrena@ehu.eus)
59
+ - **Language(s) (NLP)**: English, Spanish
60
+ - **License**: apache-2.0
61
+ - **Funding**:
62
+ - The Spanish Ministry of Science and Innovation, MCIN/AEI/ 10.13039/501100011033/FEDER projects:
63
+ - Proyectos de Generación de Conocimiento 2022 (EDHIA PID2022-136522OB-C22)
64
+ - DOTT-HEALTH/PAT-MED PID2019-543106942RB-C3.
65
+ - EU NextGeneration EU/PRTR (DeepR3 TED2021-130295B-C31, ANTIDOTE PCI2020-120717-2 EU ERA-Net CHIST-ERA).
66
+ - Basque Government:
67
+ - IXA IT1570-22.
68
+
69
+
70
+ ## Model Details
71
+
72
+ <table style="border: 1px; border-collapse: collapse;">
73
+ <caption>Pre-Training settings for EriBERTa-base.</caption>
74
+ <tbody>
75
+ <tr>
76
+ <td>Param. no.</td>
77
+ <td>~125M</td>
78
+ </tr>
79
+ <tr>
80
+ <td>Vocabulary size</td>
81
+ <td>64k</td>
82
+ </tr>
83
+ <tr>
84
+ <td>Sequence Length</td>
85
+ <td>512</td>
86
+ </tr>
87
+ <tr>
88
+ <td>Token/step</td>
89
+ <td>2M</td>
90
+ </tr>
91
+ <tr>
92
+ <td>Steps</td>
93
+ <td>125k</td>
94
+ </tr>
95
+ <tr>
96
+ <td>Total Tokens</td>
97
+ <td>4.5B</td>
98
+ </tr>
99
+ <tr>
100
+ <td>Scheduler</td>
101
+ <td>Linear with Warm-up</td>
102
+ </tr>
103
+ <tr>
104
+ <td>Peak LR</td>
105
+ <td>2.683e-4</td>
106
+ </tr>
107
+ <tr>
108
+ <td>Warm-up Steps</td>
109
+ <td>7.5k</td>
110
+ </tr>
111
+ </tbody>
112
+ </table>
113
+
114
+
115
+
116
+
117
+
118
+ ## Training Data
119
+
120
+ <table style="border: 1px; border-collapse: collapse;">
121
+ <caption>Data sources and word counts by language.</caption>
122
+ <thead>
123
+ <tr>
124
+ <th>Language</th>
125
+ <th>Source</th>
126
+ <th>Words</th>
127
+ </tr>
128
+ </thead>
129
+ <tbody>
130
+ <tr>
131
+ <td rowspan="4">English</td>
132
+ <td>ClinicalTrials</td>
133
+ <td>127.4M</td>
134
+ </tr>
135
+ <tr>
136
+ <td>EMEA</td>
137
+ <td>12M</td>
138
+ </tr>
139
+ <tr>
140
+ <td>PubMed</td>
141
+ <td>968.4M</td>
142
+ </tr>
143
+ <tr>
144
+ <td>MIMIC-III</td>
145
+ <td>206M</td>
146
+ </tr>
147
+ <tr>
148
+ <td rowspan="6">Spanish</td>
149
+ <td>EMEA</td>
150
+ <td>13.6M</td>
151
+ </tr>
152
+ <tr>
153
+ <td>PubMed</td>
154
+ <td>8.4M</td>
155
+ </tr>
156
+ <tr>
157
+ <td>Medical Crawler</td>
158
+ <td>918M</td>
159
+ </tr>
160
+ <tr>
161
+ <td>SPACC</td>
162
+ <td>350K</td>
163
+ </tr>
164
+ <tr>
165
+ <td>UFAL</td>
166
+ <td>10.5M</td>
167
+ </tr>
168
+ <tr>
169
+ <td>WikiMed</td>
170
+ <td>5.2M</td>
171
+ </tr>
172
+ </tbody>
173
+ </table>
174
+
175
+ # Limitation and Bias
176
+
177
+ <p>EriBERTa is currently optimized for masked language modeling to perform the Fill Mask task. While its potential for fine-tuning on downstream tasks such as Named Entity Recognition (NER) and Text Classification has been evaluated, it is recommended to validate and test the model for specific applications before deploying it in production to ensure its effectiveness and reliability.</p>
178
+
179
+ <p> Due to the scarcity of medical-clinical corpora, the EriBERTa model has been trained on a corpus gathered from multiple sources, including web crawling. Thus, the employed corpora may not encompass all possible linguistic and contextual variations present in clinical language. Consequently, the model may exhibit limitations when applied to specific clinical subdomains or rare medical conditions not well-represented in the training data.</p>
180
+
181
+ ## Biases
182
+
183
+ <ul>
184
+ <li><strong>Data Collection Bias:</strong> The training data for EriBERTa was collected from various sources, some of them using web crawling techniques. This method may introduce biases related to the prevalence of certain types of content, perspectives, and language usage patterns. Consequently, the model might reflect and propagate these biases in its predictions.</li>
185
+ <li><strong>Demographic and Linguistic Bias:</strong> Given that the web-sourced corpus may not equally represent all demographic groups or linguistic nuances, the model may perform disproportionately well for certain populations while underperforming for others. This could lead to disparities in the quality of clinical data processing and information retrieval across different patient groups.</li>
186
+ <li><strong>Unexamined Ethical Considerations:</strong> As of now, no comprehensive measures have been taken to systematically evaluate the ethical implications and biases embedded in EriBERTa. While we are committed to addressing these issues, the current version of the model may inadvertently perpetuate existing biases and ethical concerns inherent in the data.</li>
187
+ </ul>
188
+
189
+ ## Disclaimer
190
+ <p>EriBERTa has not been designed or developed to be used as a medical device. Any output should be verified by a Healthcare Professional, and no direct diagnosis should be claimed. The model's output may not always be completely reliable. Due to the nature of language models, predictions may be incorrect or biased.</p>
191
+ <p>We do not take any liability for the use of this model, and it should ideally be fine-tuned and tested before application. It must not be used as a medical tool or for any critical decision-making processes without thorough validation and supervision by qualified professionals.</p>
192
+
193
+
194
+ # Citing information
195
+
196
+ ```bibtext
197
+ @misc{delaiglesia2023eriberta,
198
+ title={{EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing}},
199
+ author={Iker De la Iglesia and Aitziber Atutxa and Koldo Gojenola and Ander Barrena},
200
+ year={2023},
201
+ eprint={2306.07373},
202
+ archivePrefix={arXiv},
203
+ primaryClass={cs.CL}
204
+ }
205
+ ```
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "eos_token_id": 2,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "roberta",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "transformers_version": "4.5.1",
22
+ "type_vocab_size": 1,
23
+ "use_cache": true,
24
+ "vocab_size": 64000
25
+ }
eriberta_icon.png ADDED
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dd131e0c2f50bd62d655308807047ec71a6f373ee1686f54ad371e8695f485a4
3
+ size 541124523
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": false, "errors": "replace", "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "special_tokens_map_file": null, "name_or_path": "eri/roBeRTa-osaki+bsc"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff