beatrice-portelli
/

DiLBERT

Fill-Mask Transformers PyTorch TensorFlow English bert medical disease classification Inference Endpoints

Model card Files Files and versions Community

beatrice-portelli commited on Nov 30, 2021

Commit

f59e110

•

1 Parent(s): fad8bd8

Update README.md

Files changed (1) hide show

README.md +0 -43

README.md CHANGED Viewed

@@ -1,43 +0,0 @@
----
-language:
-- en
-tags:
-- medical
-- disease
-- classification
----
-# DiLBERT (Disease Language BERT)
-The objective of this model was to obtain a specialized disease-related language, trained **from scratch**. <br>
-We created a pre-training corpora starting from **ICD-11** entities, and enriched it with documents from **PubMed** and **Wikipedia** related to the same entities. <br>
-Results of finetuning show that DiLBERT leads to comparable or higher accuracy scores on various classification tasks compared with other general-purpose or in-domain models (e.g., BioClinicalBERT, RoBERTa, XLNet).
-Model released with the paper "**DiLBERT: Cheap Embeddings for Disease Related Medical NLP**". <br>
-To summarize the practical implications of our work: we pre-trained and fine-tuned a domain specific BERT model on a small corpora, with comparable or better performance than state-of-the-art models.
-This approach may also simplify the development of models for languages different from English, due to the minor quantity of data needed for training.
-### Composition of the pretraining corpus
-| Source | Documents | Words |
-|---|---:|---:|
-| ICD-11 descriptions         | 34,676    | 1.0 million   |
-| PubMed Title and Abstracts  | 852,550   | 184.6 million |
-| Wikipedia pages             | 37,074    | 6.1 million   |
-### Main repository
-For more details check the main repo https://github.com/KevinRoitero/dilbert
-# Usage
-```python
-from transformers import AutoModelForMaskedLM, AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("beatrice-portelli/DiLBERT")
-model = AutoModelForMaskedLM.from_pretrained("beatrice-portelli/DiLBERT")
-```