ikim-uk-essen
/

geberta-large

Inference Endpoints

Model card Files Files and versions Community

amindada commited on Apr 10

Commit

0bd657b

•

1 Parent(s): c8128ea

Update README.md

Files changed (1) hide show

README.md +28 -7

README.md CHANGED Viewed

@@ -66,13 +66,34 @@ The following table presents the F1 scores:
 ## Publication
 ```bibtex
-@misc{dada2023impact,
-      title={On the Impact of Cross-Domain Data on German Language Models},
-      author={Amin Dada and Aokun Chen and Cheng Peng and Kaleb E Smith and Ahmad Idrissi-Yaghir and Constantin Marc Seibold and Jianning Li and Lars Heiliger and Xi Yang and Christoph M. Friedrich and Daniel Truhn and Jan Egger and Jiang Bian and Jens Kleesiek and Yonghui Wu},
-      year={2023},
-      eprint={2310.07321},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
 }
 ```
 ## Contact

 ## Publication
 ```bibtex
+@inproceedings{dada-etal-2023-impact,
+    title = "On the Impact of Cross-Domain Data on {G}erman Language Models",
+    author = "Dada, Amin  and
+      Chen, Aokun  and
+      Peng, Cheng  and
+      Smith, Kaleb  and
+      Idrissi-Yaghir, Ahmad  and
+      Seibold, Constantin  and
+      Li, Jianning  and
+      Heiliger, Lars  and
+      Friedrich, Christoph  and
+      Truhn, Daniel  and
+      Egger, Jan  and
+      Bian, Jiang  and
+      Kleesiek, Jens  and
+      Wu, Yonghui",
+    editor = "Bouamor, Houda  and
+      Pino, Juan  and
+      Bali, Kalika",
+    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
+    month = dec,
+    year = "2023",
+    address = "Singapore",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.findings-emnlp.922",
+    doi = "10.18653/v1/2023.findings-emnlp.922",
+    pages = "13801--13813",
+    abstract = "Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45{\%} over the previous state-of-the-art.",
 }
 ```
 ## Contact