geberta-large / README.md
amindada's picture
Update README.md
0bd657b verified
|
raw
history blame
No virus
5.87 kB
metadata
{}

GeBERTa

GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. The models range in size from 122M to 750M parameters.

Model details

The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary, while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps and have a maximum sequence length of 512 tokens.

Dataset

The pre-training dataset consists of documents from different domains:

Domain Dataset Data Size #Docs #Tokens
Formal Wikipedia 9GB 2,665,357 1.9B
Formal News 28GB 12,305,326 6.1B
Formal GC4 90GB 31,669,772 19.4B
Informal Reddit 2019-2023 (GER) 5.8GB 15,036,592 1.3B
Informal Holiday Reviews 2GB 4,876,405 428M
Legal OpenLegalData: German cases and laws 5.4GB 308,228 1B
Medical Smaller public datasets 253MB 179,776 50M
Medical CC medical texts 3.6GB 2,000,000 682M
Medical Medicine Dissertations 1.4GB 14,496 295M
Medical Pubmed abstracts (translated) 8.5GB 21,044,382 1.7B
Medical MIMIC III (translated) 2.6GB 24,221,834 695M
Medical PMC-Patients-ReCDS (translated) 2.1GB 1,743,344 414M
Literature German Fiction 1.1GB 3,219 243M
Literature English books (translated) 7.1GB 11,038 1.6B
- Total 167GB 116,079,769 35.8B

Benchmark

In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering, classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets. When the datasets provided training, development, and test sets, we used them accordingly.

We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available. The following table presents the F1 scores:

Model GE14 GQuAD GE18 TS GGP GRAS1 JS DROC Avg
GBERTlarge 88.48±0.23 81.51±0.84 54.37±1.65 73.60±0.61 79.17±0.14 69.28±0.80 76.32±4.42 90.29±0.15 76.63±0.63
GELECTRAlarge 88.39±0.13 80.51±0.41 55.41±1.54 73.84±0.86 79.09±0.09 70.16±0.92 73.73±2.35 89.83±0.27 76.37±0.69
GeBERTalarge 88.84±0.18 82.52±0.59 53.76±1.86 75.32±0.53 78.35±0.08 70.02±1.34 82.16±2.36 90.39±0.24 77.67±0.69
GeBERTaxlarge 89.04±0.26 85.05±0.63 55.80±1.42 76.25±0.704 76.71±0.08 67.92±1.00 82.42±4.70 90.63±0.21 77.98±0.62

1Is not published yet but is described in the MedBERT.de paper.

Publication

@inproceedings{dada-etal-2023-impact,
    title = "On the Impact of Cross-Domain Data on {G}erman Language Models",
    author = "Dada, Amin  and
      Chen, Aokun  and
      Peng, Cheng  and
      Smith, Kaleb  and
      Idrissi-Yaghir, Ahmad  and
      Seibold, Constantin  and
      Li, Jianning  and
      Heiliger, Lars  and
      Friedrich, Christoph  and
      Truhn, Daniel  and
      Egger, Jan  and
      Bian, Jiang  and
      Kleesiek, Jens  and
      Wu, Yonghui",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.922",
    doi = "10.18653/v1/2023.findings-emnlp.922",
    pages = "13801--13813",
    abstract = "Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45{\%} over the previous state-of-the-art.",
}

Contact

amin.dada@uk-essen.de