beatrice-portelli commited on
Commit
f59e110
1 Parent(s): fad8bd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -43
README.md CHANGED
@@ -1,43 +0,0 @@
1
- ---
2
- language:
3
- - en
4
- tags:
5
- - medical
6
- - disease
7
- - classification
8
- ---
9
-
10
-
11
- # DiLBERT (Disease Language BERT)
12
-
13
- The objective of this model was to obtain a specialized disease-related language, trained **from scratch**. <br>
14
- We created a pre-training corpora starting from **ICD-11** entities, and enriched it with documents from **PubMed** and **Wikipedia** related to the same entities. <br>
15
- Results of finetuning show that DiLBERT leads to comparable or higher accuracy scores on various classification tasks compared with other general-purpose or in-domain models (e.g., BioClinicalBERT, RoBERTa, XLNet).
16
-
17
- Model released with the paper "**DiLBERT: Cheap Embeddings for Disease Related Medical NLP**". <br>
18
- To summarize the practical implications of our work: we pre-trained and fine-tuned a domain specific BERT model on a small corpora, with comparable or better performance than state-of-the-art models.
19
- This approach may also simplify the development of models for languages different from English, due to the minor quantity of data needed for training.
20
-
21
- ### Composition of the pretraining corpus
22
-
23
-
24
- | Source | Documents | Words |
25
- |---|---:|---:|
26
- | ICD-11 descriptions | 34,676 | 1.0 million |
27
- | PubMed Title and Abstracts | 852,550 | 184.6 million |
28
- | Wikipedia pages | 37,074 | 6.1 million |
29
-
30
- ### Main repository
31
-
32
- For more details check the main repo https://github.com/KevinRoitero/dilbert
33
-
34
- # Usage
35
-
36
- ```python
37
- from transformers import AutoModelForMaskedLM, AutoTokenizer
38
-
39
- tokenizer = AutoTokenizer.from_pretrained("beatrice-portelli/DiLBERT")
40
- model = AutoModelForMaskedLM.from_pretrained("beatrice-portelli/DiLBERT")
41
- ```
42
-
43
-