DiLBERT (Disease Language BERT)

The objective of this model was to obtain a specialized disease-related language, trained from scratch.
We created a pre-training corpora starting from ICD-11 entities, and enriched it with documents from PubMed and Wikipedia related to the same entities.
Results of finetuning show that DiLBERT leads to comparable or higher accuracy scores on various classification tasks compared with other general-purpose or in-domain models (e.g., BioClinicalBERT, RoBERTa, XLNet).

Model released with the paper "DiLBERT: Cheap Embeddings for Disease Related Medical NLP".
To summarize the practical implications of our work: we pre-trained and fine-tuned a domain specific BERT model on a small corpora, with comparable or better performance than state-of-the-art models.
This approach may also simplify the development of models for languages different from English, due to the minor quantity of data needed for training.

Composition of the pretraining corpus

Source Documents Words
ICD-11 descriptions 34,676 1.0 million
PubMed Title and Abstracts 852,550 184.6 million
Wikipedia pages 37,074 6.1 million

Main repository

For more details check the main repo https://github.com/KevinRoitero/dilbert

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("beatrice-portelli/DiLBERT")
model = AutoModelForMaskedLM.from_pretrained("beatrice-portelli/DiLBERT")

How to cite

@article{roitero2021dilbert,
  title={{DilBERT}: Cheap Embeddings for Disease Related Medical NLP},
  author={Roitero, Kevin and Portelli, Beatrice and Popescu, Mihai Horia and Della Mea, Vincenzo},
  journal={IEEE Access},
  volume={},
  pages={},
  year={2021},
  publisher={IEEE},
  note = {In Press}
}
Downloads last month
22
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.