DiLBERT (Disease Language BERT)

The objective of this model was to obtain a specialized disease-related language, trained from scratch.
We created a pre-training corpora starting from ICD-11 entities, and enriched it with documents from PubMed and Wikipedia related to the same entities.
Results of finetuning show that DiLBERT leads to comparable or higher accuracy scores on various classification tasks compared with other general-purpose or in-domain models (e.g., BioClinicalBERT, RoBERTa, XLNet).

Model released with the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP".
To summarize the practical implications of our work: we pre-trained and fine-tuned a domain specific BERT model on a small corpora, with comparable or better performance than state-of-the-art models.
This approach may also simplify the development of models for languages different from English, due to the minor quantity of data needed for training.

Composition of the pretraining corpus

Source	Documents	Words
ICD-11 descriptions	34,676	1.0 million
PubMed Title and Abstracts	852,550	184.6 million
Wikipedia pages	37,074	6.1 million

Main repository

For more details check the main repo https://github.com/KevinRoitero/dilbert

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("beatrice-portelli/DiLBERT")
model = AutoModelForMaskedLM.from_pretrained("beatrice-portelli/DiLBERT")

How to cite

@article{roitero2021dilbert,
  title={{DilBERT}: Cheap Embeddings for Disease Related Medical NLP},
  author={Roitero, Kevin and Portelli, Beatrice and Popescu, Mihai Horia and Della Mea, Vincenzo},
  journal={IEEE Access},
  volume={},
  pages={},
  year={2021},
  publisher={IEEE},
  note = {In Press}
}