Update README.md
Browse files
README.md
CHANGED
@@ -20,45 +20,40 @@ widget:
|
|
20 |
<details>
|
21 |
<summary>Click to expand</summary>
|
22 |
|
23 |
-
- [Model
|
24 |
-
- [Intended
|
25 |
-
- [How to
|
26 |
- [Limitations and bias](#limitations-and-bias)
|
27 |
- [Training](#training)
|
28 |
-
- [
|
29 |
-
- [Training
|
30 |
- [Evaluation](#evaluation)
|
31 |
-
- [Additional
|
32 |
-
- [Contact
|
33 |
- [Copyright](#copyright)
|
34 |
-
- [Licensing
|
35 |
- [Funding](#funding)
|
36 |
-
- [Citation
|
37 |
-
- [Contributions](#contributions)
|
38 |
- [Disclaimer](#disclaimer)
|
39 |
|
40 |
</details>
|
41 |
|
42 |
|
43 |
## Model description
|
44 |
-
|
45 |
Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
|
46 |
|
47 |
-
## Intended uses
|
48 |
-
|
49 |
-
The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
|
50 |
|
51 |
-
However,
|
52 |
|
53 |
-
## How to
|
54 |
|
55 |
|
56 |
## Limitations and bias
|
57 |
-
|
58 |
|
59 |
## Training
|
60 |
|
61 |
-
|
62 |
### Tokenization and model pretraining
|
63 |
|
64 |
This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
|
@@ -96,8 +91,7 @@ The result is a medium-size biomedical corpus for Spanish composed of about 963M
|
|
96 |
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
97 |
|
98 |
|
99 |
-
|
100 |
-
## Evaluation and results
|
101 |
|
102 |
The model has been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
|
103 |
|
@@ -122,23 +116,22 @@ The fine-tuning scripts can be found in the official GitHub [repository](https:/
|
|
122 |
|
123 |
## Additional information
|
124 |
|
125 |
-
###
|
|
|
126 |
|
|
|
127 |
For further information, send an email to <plantl-gob-es@bsc.es>
|
128 |
|
129 |
### Copyright
|
130 |
-
|
131 |
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
132 |
|
133 |
### Licensing information
|
134 |
-
|
135 |
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
136 |
|
137 |
### Funding
|
138 |
-
|
139 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
140 |
|
141 |
-
###
|
142 |
If you use these models, please cite our work:
|
143 |
|
144 |
```bibtext
|
@@ -164,13 +157,6 @@ If you use these models, please cite our work:
|
|
164 |
abstract = "This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.",
|
165 |
}
|
166 |
```
|
167 |
-
---
|
168 |
-
|
169 |
-
|
170 |
-
### Contributions
|
171 |
-
|
172 |
-
[N/A]
|
173 |
-
|
174 |
|
175 |
### Disclaimer
|
176 |
|
|
|
20 |
<details>
|
21 |
<summary>Click to expand</summary>
|
22 |
|
23 |
+
- [Model description](#model-description)
|
24 |
+
- [Intended uses and limitations](#intended-use)
|
25 |
+
- [How to use](#how-to-use)
|
26 |
- [Limitations and bias](#limitations-and-bias)
|
27 |
- [Training](#training)
|
28 |
+
- [Tokenization and model pretraining](#Tokenization-modelpretraining)
|
29 |
+
- [Training corpora and preprocessing](#Trainingcorpora-preprocessing)
|
30 |
- [Evaluation](#evaluation)
|
31 |
+
- [Additional information](#additional-information)
|
32 |
+
- [Contact information](#contact-information)
|
33 |
- [Copyright](#copyright)
|
34 |
+
- [Licensing information](#licensing-information)
|
35 |
- [Funding](#funding)
|
36 |
+
- [Citation information](#citation-information)
|
|
|
37 |
- [Disclaimer](#disclaimer)
|
38 |
|
39 |
</details>
|
40 |
|
41 |
|
42 |
## Model description
|
|
|
43 |
Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
|
44 |
|
45 |
+
## Intended uses and limitations
|
|
|
|
|
46 |
|
47 |
+
The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section). However, it is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
|
48 |
|
49 |
+
## How to use
|
50 |
|
51 |
|
52 |
## Limitations and bias
|
53 |
+
At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
|
54 |
|
55 |
## Training
|
56 |
|
|
|
57 |
### Tokenization and model pretraining
|
58 |
|
59 |
This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
|
|
|
91 |
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
92 |
|
93 |
|
94 |
+
## Evaluation
|
|
|
95 |
|
96 |
The model has been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
|
97 |
|
|
|
116 |
|
117 |
## Additional information
|
118 |
|
119 |
+
### Author
|
120 |
+
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
|
121 |
|
122 |
+
### Contact information
|
123 |
For further information, send an email to <plantl-gob-es@bsc.es>
|
124 |
|
125 |
### Copyright
|
|
|
126 |
Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
|
127 |
|
128 |
### Licensing information
|
|
|
129 |
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
130 |
|
131 |
### Funding
|
|
|
132 |
This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
|
133 |
|
134 |
+
### Citation information
|
135 |
If you use these models, please cite our work:
|
136 |
|
137 |
```bibtext
|
|
|
157 |
abstract = "This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.",
|
158 |
}
|
159 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
160 |
|
161 |
### Disclaimer
|
162 |
|