ikim-uk-essen
/

geberta-large

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{}
+---
+# GeBERTa
+<!-- Provide a quick summary of what the model is/does. -->
+GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM.
+The models range in size from 122M to 750M parameters.
+## Model details
+The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary,
+while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps
+and have a maximum sequence length of 512 tokens.
+## Dataset
+The pre-training dataset consists of documents from different domains:
+| Domain | Dataset | Data Size | #Docs | #Tokens |
+| -------- | ----------- | --------- | ------ | ------- |
+| Formal | Wikipedia | 9GB | 2,665,357 | 1.9B |
+| Formal | News | 28GB | 12,305,326 | 6.1B |
+| Formal | GC4 | 90GB | 31,669,772 | 19.4B |
+| Informal | Reddit 2019-2023 (GER) | 5.8GB | 15,036,592 | 1.3B |
+| Informal | Holiday Reviews | 2GB | 4,876,405 | 428M |
+| Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B |
+| Medical | Smaller public datasets | 253MB | 179,776 | 50M |
+| Medical | CC medical texts | 3.6GB | 2,000,000 | 682M |
+| Medical | Medical Dissertations | 1.4GB | 14,496 | 295M |
+| Medical | Pubmed abstracts | 8.5GB | 21,044,382 | 1.7B |
+| Medical | MIMIC III | 2.6GB | 24,221,834 | 695M |
+| Medical | PMC-Patients-ReCDS | 2.1GB | 1,743,344 | 414M |
+| Literature | German Fiction | 1.1GB | 3,219 | 243M |
+| Literature | English books | 7.1GB | 11,038 | 1.6B |
+| - | Total | 167GB | 116,079,769 | 35.8B |
+## Benchmark
+In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering,
+classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets.
+When the datasets provided training, development, and test sets, we used them accordingly.
+We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available.
+The following table presents the F1 scores:
+|         Model         |   [GE14](https://huggingface.co/datasets/germeval_14)   |  [GQuAD](https://huggingface.co/datasets/deepset/germanquad)  |   [GE18](https://huggingface.co/datasets/philschmid/germeval18)   |    TS    |   [GGP](https://github.com/JULIELab/GGPOnc)   |  GRAS<sup>1</sup>  |    [JS](https://github.com/JULIELab/jsyncc)    |  [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release)  |  Avg   |
+|:---------------------:|:--------:|:----------:|:--------:|:--------:|:-------:|:------:|:--------:|:------:|:------:|
+| GBERT<sub>large</sub>             | 88.48±0.23    | 81.51±0.84    | 54.37±1.65   | 73.60±0.61   | **79.17**±0.14   | 69.28±0.80    | 76.32±4.42    | 90.29±0.15    | 76.63±0.63    |
+| GELECTRA<sub>large</sub>          | 88.39±0.13    | 80.51±0.41    | 55.41±1.54   | 73.84±0.86   | 79.09±0.09   | **70.16**±0.92    | 73.73±2.35    | 89.83±0.27    | 76.37±0.69    |
+| GeBERTa<sub>large</sub>   | 88.84±0.18    | 82.52±0.59    | 53.76±1.86   | 75.32±0.53   | 78.35±0.08   | 70.02±1.34    | 82.16±2.36    | 90.39±0.24    | 77.67±0.69    |
+| GeBERTa<sub>xlarge</sub>  | **89.04**±0.26    | **85.05**±0.63    | **55.80**±1.42   | **76.25**±0.704  | 76.71±0.08   | 67.92±1.00    | **82.42**±4.70    | **90.63**±0.21    | **77.98**±0.62    |
+<sup>1</sup>Is not published yet but is described in the [MedBERT.de paper](https://arxiv.org/abs/2303.08179).
+## Publication
+The publication is following soon.
+## Contact
+<amin.dada@uk-essen.de>