Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
|
3 |
+
# Doc / guide: https://huggingface.co/docs/hub/model-cards
|
4 |
+
{}
|
5 |
+
---
|
6 |
+
|
7 |
+
# GeBERTa
|
8 |
+
|
9 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
+
GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM.
|
11 |
+
The models range in size from 122M to 750M parameters.
|
12 |
+
|
13 |
+
|
14 |
+
## Model details
|
15 |
+
|
16 |
+
The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary,
|
17 |
+
while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps
|
18 |
+
and have a maximum sequence length of 512 tokens.
|
19 |
+
|
20 |
+
|
21 |
+
## Dataset
|
22 |
+
|
23 |
+
The pre-training dataset consists of documents from different domains:
|
24 |
+
|
25 |
+
| Domain | Dataset | Data Size | #Docs | #Tokens |
|
26 |
+
| -------- | ----------- | --------- | ------ | ------- |
|
27 |
+
| Formal | Wikipedia | 9GB | 2,665,357 | 1.9B |
|
28 |
+
| Formal | News | 28GB | 12,305,326 | 6.1B |
|
29 |
+
| Formal | GC4 | 90GB | 31,669,772 | 19.4B |
|
30 |
+
| Informal | Reddit 2019-2023 (GER) | 5.8GB | 15,036,592 | 1.3B |
|
31 |
+
| Informal | Holiday Reviews | 2GB | 4,876,405 | 428M |
|
32 |
+
| Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B |
|
33 |
+
| Medical | Smaller public datasets | 253MB | 179,776 | 50M |
|
34 |
+
| Medical | CC medical texts | 3.6GB | 2,000,000 | 682M |
|
35 |
+
| Medical | Medical Dissertations | 1.4GB | 14,496 | 295M |
|
36 |
+
| Medical | Pubmed abstracts | 8.5GB | 21,044,382 | 1.7B |
|
37 |
+
| Medical | MIMIC III | 2.6GB | 24,221,834 | 695M |
|
38 |
+
| Medical | PMC-Patients-ReCDS | 2.1GB | 1,743,344 | 414M |
|
39 |
+
| Literature | German Fiction | 1.1GB | 3,219 | 243M |
|
40 |
+
| Literature | English books | 7.1GB | 11,038 | 1.6B |
|
41 |
+
| - | Total | 167GB | 116,079,769 | 35.8B |
|
42 |
+
|
43 |
+
|
44 |
+
## Benchmark
|
45 |
+
|
46 |
+
In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering,
|
47 |
+
classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets.
|
48 |
+
When the datasets provided training, development, and test sets, we used them accordingly.
|
49 |
+
|
50 |
+
|
51 |
+
|
52 |
+
We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available.
|
53 |
+
The following table presents the F1 scores:
|
54 |
+
|
55 |
+
|
56 |
+
|
57 |
+
| Model | [GE14](https://huggingface.co/datasets/germeval_14) | [GQuAD](https://huggingface.co/datasets/deepset/germanquad) | [GE18](https://huggingface.co/datasets/philschmid/germeval18) | TS | [GGP](https://github.com/JULIELab/GGPOnc) | GRAS<sup>1</sup> | [JS](https://github.com/JULIELab/jsyncc) | [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) | Avg |
|
58 |
+
|:---------------------:|:--------:|:----------:|:--------:|:--------:|:-------:|:------:|:--------:|:------:|:------:|
|
59 |
+
| GBERT<sub>large</sub> | 88.48±0.23 | 81.51±0.84 | 54.37±1.65 | 73.60±0.61 | **79.17**±0.14 | 69.28±0.80 | 76.32±4.42 | 90.29±0.15 | 76.63±0.63 |
|
60 |
+
| GELECTRA<sub>large</sub> | 88.39±0.13 | 80.51±0.41 | 55.41±1.54 | 73.84±0.86 | 79.09±0.09 | **70.16**±0.92 | 73.73±2.35 | 89.83±0.27 | 76.37±0.69 |
|
61 |
+
| GeBERTa<sub>large</sub> | 88.84±0.18 | 82.52±0.59 | 53.76±1.86 | 75.32±0.53 | 78.35±0.08 | 70.02±1.34 | 82.16±2.36 | 90.39±0.24 | 77.67±0.69 |
|
62 |
+
| GeBERTa<sub>xlarge</sub> | **89.04**±0.26 | **85.05**±0.63 | **55.80**±1.42 | **76.25**±0.704 | 76.71±0.08 | 67.92±1.00 | **82.42**±4.70 | **90.63**±0.21 | **77.98**±0.62 |
|
63 |
+
|
64 |
+
<sup>1</sup>Is not published yet but is described in the [MedBERT.de paper](https://arxiv.org/abs/2303.08179).
|
65 |
+
|
66 |
+
## Publication
|
67 |
+
|
68 |
+
The publication is following soon.
|
69 |
+
|
70 |
+
## Contact
|
71 |
+
|
72 |
+
<amin.dada@uk-essen.de>
|
73 |
+
|
74 |
+
|
75 |
+
|
76 |
+
|
77 |
+
|