amindada commited on
Commit
96f630d
1 Parent(s): 78e104f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
+
7
+ # GeBERTa
8
+
9
+ <!-- Provide a quick summary of what the model is/does. -->
10
+ GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM.
11
+ The models range in size from 122M to 750M parameters.
12
+
13
+
14
+ ## Model details
15
+
16
+ The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary,
17
+ while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps
18
+ and have a maximum sequence length of 512 tokens.
19
+
20
+
21
+ ## Dataset
22
+
23
+ The pre-training dataset consists of documents from different domains:
24
+
25
+ | Domain | Dataset | Data Size | #Docs | #Tokens |
26
+ | -------- | ----------- | --------- | ------ | ------- |
27
+ | Formal | Wikipedia | 9GB | 2,665,357 | 1.9B |
28
+ | Formal | News | 28GB | 12,305,326 | 6.1B |
29
+ | Formal | GC4 | 90GB | 31,669,772 | 19.4B |
30
+ | Informal | Reddit 2019-2023 (GER) | 5.8GB | 15,036,592 | 1.3B |
31
+ | Informal | Holiday Reviews | 2GB | 4,876,405 | 428M |
32
+ | Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B |
33
+ | Medical | Smaller public datasets | 253MB | 179,776 | 50M |
34
+ | Medical | CC medical texts | 3.6GB | 2,000,000 | 682M |
35
+ | Medical | Medical Dissertations | 1.4GB | 14,496 | 295M |
36
+ | Medical | Pubmed abstracts | 8.5GB | 21,044,382 | 1.7B |
37
+ | Medical | MIMIC III | 2.6GB | 24,221,834 | 695M |
38
+ | Medical | PMC-Patients-ReCDS | 2.1GB | 1,743,344 | 414M |
39
+ | Literature | German Fiction | 1.1GB | 3,219 | 243M |
40
+ | Literature | English books | 7.1GB | 11,038 | 1.6B |
41
+ | - | Total | 167GB | 116,079,769 | 35.8B |
42
+
43
+
44
+ ## Benchmark
45
+
46
+ In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering,
47
+ classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets.
48
+ When the datasets provided training, development, and test sets, we used them accordingly.
49
+
50
+
51
+
52
+ We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available.
53
+ The following table presents the F1 scores:
54
+
55
+
56
+
57
+ | Model | [GE14](https://huggingface.co/datasets/germeval_14) | [GQuAD](https://huggingface.co/datasets/deepset/germanquad) | [GE18](https://huggingface.co/datasets/philschmid/germeval18) | TS | [GGP](https://github.com/JULIELab/GGPOnc) | GRAS<sup>1</sup> | [JS](https://github.com/JULIELab/jsyncc) | [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) | Avg |
58
+ |:---------------------:|:--------:|:----------:|:--------:|:--------:|:-------:|:------:|:--------:|:------:|:------:|
59
+ | GBERT<sub>large</sub> | 88.48±0.23 | 81.51±0.84 | 54.37±1.65 | 73.60±0.61 | **79.17**±0.14 | 69.28±0.80 | 76.32±4.42 | 90.29±0.15 | 76.63±0.63 |
60
+ | GELECTRA<sub>large</sub> | 88.39±0.13 | 80.51±0.41 | 55.41±1.54 | 73.84±0.86 | 79.09±0.09 | **70.16**±0.92 | 73.73±2.35 | 89.83±0.27 | 76.37±0.69 |
61
+ | GeBERTa<sub>large</sub> | 88.84±0.18 | 82.52±0.59 | 53.76±1.86 | 75.32±0.53 | 78.35±0.08 | 70.02±1.34 | 82.16±2.36 | 90.39±0.24 | 77.67±0.69 |
62
+ | GeBERTa<sub>xlarge</sub> | **89.04**±0.26 | **85.05**±0.63 | **55.80**±1.42 | **76.25**±0.704 | 76.71±0.08 | 67.92±1.00 | **82.42**±4.70 | **90.63**±0.21 | **77.98**±0.62 |
63
+
64
+ <sup>1</sup>Is not published yet but is described in the [MedBERT.de paper](https://arxiv.org/abs/2303.08179).
65
+
66
+ ## Publication
67
+
68
+ The publication is following soon.
69
+
70
+ ## Contact
71
+
72
+ <amin.dada@uk-essen.de>
73
+
74
+
75
+
76
+
77
+