beatrice-portelli commited on
Commit
6bf28b8
1 Parent(s): f59e110

update README

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md CHANGED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - medical
6
+ - disease
7
+ - classification
8
+ ---
9
+
10
+
11
+ # DiLBERT (Disease Language BERT)
12
+
13
+ The objective of this model was to obtain a specialized disease-related language, trained **from scratch**. <br>
14
+ We created a pre-training corpora starting from **ICD-11** entities, and enriched it with documents from **PubMed** and **Wikipedia** related to the same entities. <br>
15
+ Results of finetuning show that DiLBERT leads to comparable or higher accuracy scores on various classification tasks compared with other general-purpose or in-domain models (e.g., BioClinicalBERT, RoBERTa, XLNet).
16
+
17
+ Model released with the paper "**DiLBERT: Cheap Embeddings for Disease Related Medical NLP**". <br>
18
+ To summarize the practical implications of our work: we pre-trained and fine-tuned a domain specific BERT model on a small corpora, with comparable or better performance than state-of-the-art models.
19
+ This approach may also simplify the development of models for languages different from English, due to the minor quantity of data needed for training.
20
+
21
+ ### Composition of the pretraining corpus
22
+
23
+
24
+ | Source | Documents | Words |
25
+ |---|---:|---:|
26
+ | ICD-11 descriptions | 34,676 | 1.0 million |
27
+ | PubMed Title and Abstracts | 852,550 | 184.6 million |
28
+ | Wikipedia pages | 37,074 | 6.1 million |
29
+
30
+ ### Main repository
31
+
32
+ For more details check the main repo https://github.com/KevinRoitero/dilbert
33
+
34
+ # Usage
35
+
36
+ ```python
37
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
38
+
39
+ tokenizer = AutoTokenizer.from_pretrained("beatrice-portelli/DiLBERT")
40
+ model = AutoModelForMaskedLM.from_pretrained("beatrice-portelli/DiLBERT")
41
+ ```
42
+
43
+ # How to cite
44
+
45
+ ```
46
+ @article{roitero2021dilbert,
47
+ title={{DilBERT}: Cheap Embeddings for Disease Related Medical NLP},
48
+ author={Roitero, Kevin and Portelli, Beatrice and Popescu, Mihai Horia and Della Mea, Vincenzo},
49
+ journal={IEEE Access},
50
+ volume={},
51
+ pages={},
52
+ year={2021},
53
+ publisher={IEEE},
54
+ note = {In Press}
55
+ }
56
+ ```