metadata

{}

GeBERTa

GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. The models range in size from 122M to 750M parameters.

Model details

The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary, while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps and have a maximum sequence length of 512 tokens.

Dataset

The pre-training dataset consists of documents from different domains:

Domain	Dataset	Data Size	#Docs	#Tokens
Formal	Wikipedia	9GB	2,665,357	1.9B
Formal	News	28GB	12,305,326	6.1B
Formal	GC4	90GB	31,669,772	19.4B
Informal	Reddit 2019-2023 (GER)	5.8GB	15,036,592	1.3B
Informal	Holiday Reviews	2GB	4,876,405	428M
Legal	OpenLegalData: German cases and laws	5.4GB	308,228	1B
Medical	Smaller public datasets	253MB	179,776	50M
Medical	CC medical texts	3.6GB	2,000,000	682M
Medical	Medicine Dissertations	1.4GB	14,496	295M
Medical	Pubmed abstracts (translated)	8.5GB	21,044,382	1.7B
Medical	MIMIC III (translated)	2.6GB	24,221,834	695M
Medical	PMC-Patients-ReCDS (translated)	2.1GB	1,743,344	414M
Literature	German Fiction	1.1GB	3,219	243M
Literature	English books (translated)	7.1GB	11,038	1.6B
-	Total	167GB	116,079,769	35.8B

Benchmark

In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering, classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets. When the datasets provided training, development, and test sets, we used them accordingly.

We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available. The following table presents the F1 scores:

Model	GE14	GQuAD	GE18	TS	GGP	GRAS¹	JS	DROC	Avg
GBERT_large	88.48±0.23	81.51±0.84	54.37±1.65	73.60±0.61	79.17±0.14	69.28±0.80	76.32±4.42	90.29±0.15	76.63±0.63
GELECTRA_large	88.39±0.13	80.51±0.41	55.41±1.54	73.84±0.86	79.09±0.09	70.16±0.92	73.73±2.35	89.83±0.27	76.37±0.69
GeBERTa_large	88.84±0.18	82.52±0.59	53.76±1.86	75.32±0.53	78.35±0.08	70.02±1.34	82.16±2.36	90.39±0.24	77.67±0.69
GeBERTa_xlarge	89.04±0.26	85.05±0.63	55.80±1.42	76.25±0.704	76.71±0.08	67.92±1.00	82.42±4.70	90.63±0.21	77.98±0.62

¹Is not published yet but is described in the MedBERT.de paper.

Publication

@inproceedings{dada-etal-2023-impact,
    title = "On the Impact of Cross-Domain Data on {G}erman Language Models",
    author = "Dada, Amin  and
      Chen, Aokun  and
      Peng, Cheng  and
      Smith, Kaleb  and
      Idrissi-Yaghir, Ahmad  and
      Seibold, Constantin  and
      Li, Jianning  and
      Heiliger, Lars  and
      Friedrich, Christoph  and
      Truhn, Daniel  and
      Egger, Jan  and
      Bian, Jiang  and
      Kleesiek, Jens  and
      Wu, Yonghui",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.922",
    doi = "10.18653/v1/2023.findings-emnlp.922",
    pages = "13801--13813",
    abstract = "Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45{\%} over the previous state-of-the-art.",
}

Contact

amin.dada@uk-essen.de