Edit model card

EriBERTa
A Bilingual Pre-Trained Language Model
for Clinical Natural Language Processing


We introduce EriBERTa, a bilingual domain-specific language model pre-trained on extensive medical and clinical corpora. We demonstrate that EriBERTa outperforms previous Spanish language models in the clinical domain, showcasing its superior capabilities in understanding medical texts and extracting meaningful information. Moreover, EriBERTa exhibits promising transfer learning abilities, allowing for knowledge transfer from one language to another. This aspect is particularly beneficial given the scarcity of Spanish clinical data.

How to Get Started with the Model

You can load the model using:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HiTZ/EriBERTa-base")
model = AutoModelForMaskedLM.from_pretrained("HiTZ/EriBERTa-base")

Model Description

  • Developed by: Iker De la Iglesia, Aitziber Atutxa, Koldo Gojenola, and Ander Barrena
  • Contact: Iker De la Iglesia and Ander Barrena
  • Language(s) (NLP): English, Spanish
  • License: apache-2.0
  • Funding:
    • The Spanish Ministry of Science and Innovation, MCIN/AEI/ 10.13039/501100011033/FEDER projects:
      • Proyectos de Generación de Conocimiento 2022 (EDHIA PID2022-136522OB-C22)
      • DOTT-HEALTH/PAT-MED PID2019-543106942RB-C31.
      • EU NextGeneration EU/PRTR (DeepR3 TED2021-130295B-C31, ANTIDOTE PCI2020-120717-2 EU ERA-Net CHIST-ERA).
    • Basque Government:
      • IXA IT1570-22.

Model Details

Pre-Training settings for EriBERTa-base.
Param. no. ~125M
Vocabulary size 64k
Sequence Length 512
Token/step 2M
Steps 125k
Total Tokens 4.5B
Scheduler Linear with Warm-up
Peak LR 2.683e-4
Warm-up Steps 7.5k

Training Data

Data sources and word counts by language.
Language Source Words
English ClinicalTrials 127.4M
EMEA 12M
PubMed 968.4M
MIMIC-III 206M
Spanish EMEA 13.6M
PubMed 8.4M
Medical Crawler 918M
SPACC 350K
UFAL 10.5M
WikiMed 5.2M

Limitation and Bias

EriBERTa is currently optimized for masked language modeling to perform the Fill Mask task. While its potential for fine-tuning on downstream tasks such as Named Entity Recognition (NER) and Text Classification has been evaluated, it is recommended to validate and test the model for specific applications before deploying it in production to ensure its effectiveness and reliability.

Due to the scarcity of medical-clinical corpora, the EriBERTa model has been trained on a corpus gathered from multiple sources, including web crawling. Thus, the employed corpora may not encompass all possible linguistic and contextual variations present in clinical language. Consequently, the model may exhibit limitations when applied to specific clinical subdomains or rare medical conditions not well-represented in the training data.

Biases

  • Data Collection Bias: The training data for EriBERTa was collected from various sources, some of them using web crawling techniques. This method may introduce biases related to the prevalence of certain types of content, perspectives, and language usage patterns. Consequently, the model might reflect and propagate these biases in its predictions.
  • Demographic and Linguistic Bias: Given that the web-sourced corpus may not equally represent all demographic groups or linguistic nuances, the model may perform disproportionately well for certain populations while underperforming for others. This could lead to disparities in the quality of clinical data processing and information retrieval across different patient groups.
  • Unexamined Ethical Considerations: As of now, no comprehensive measures have been taken to systematically evaluate the ethical implications and biases embedded in EriBERTa. While we are committed to addressing these issues, the current version of the model may inadvertently perpetuate existing biases and ethical concerns inherent in the data.

Disclaimer

EriBERTa has not been designed or developed to be used as a medical device. Any output should be verified by a Healthcare Professional, and no direct diagnosis should be claimed. The model's output may not always be completely reliable. Due to the nature of language models, predictions may be incorrect or biased.

We do not take any liability for the use of this model, and it should ideally be fine-tuned and tested before application. It must not be used as a medical tool or for any critical decision-making processes without thorough validation and supervision by qualified professionals.

Citing information

@misc{delaiglesia2023eriberta,
      title={{EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing}}, 
      author={Iker De la Iglesia and Aitziber Atutxa and Koldo Gojenola and Ander Barrena},
      year={2023},
      eprint={2306.07373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
26

Dataset used to train HiTZ/EriBERTa-base