amindada commited on
Commit
0bd657b
1 Parent(s): c8128ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -7
README.md CHANGED
@@ -66,13 +66,34 @@ The following table presents the F1 scores:
66
  ## Publication
67
 
68
  ```bibtex
69
- @misc{dada2023impact,
70
- title={On the Impact of Cross-Domain Data on German Language Models},
71
- author={Amin Dada and Aokun Chen and Cheng Peng and Kaleb E Smith and Ahmad Idrissi-Yaghir and Constantin Marc Seibold and Jianning Li and Lars Heiliger and Xi Yang and Christoph M. Friedrich and Daniel Truhn and Jan Egger and Jiang Bian and Jens Kleesiek and Yonghui Wu},
72
- year={2023},
73
- eprint={2310.07321},
74
- archivePrefix={arXiv},
75
- primaryClass={cs.CL}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  }
77
  ```
78
  ## Contact
 
66
  ## Publication
67
 
68
  ```bibtex
69
+ @inproceedings{dada-etal-2023-impact,
70
+ title = "On the Impact of Cross-Domain Data on {G}erman Language Models",
71
+ author = "Dada, Amin and
72
+ Chen, Aokun and
73
+ Peng, Cheng and
74
+ Smith, Kaleb and
75
+ Idrissi-Yaghir, Ahmad and
76
+ Seibold, Constantin and
77
+ Li, Jianning and
78
+ Heiliger, Lars and
79
+ Friedrich, Christoph and
80
+ Truhn, Daniel and
81
+ Egger, Jan and
82
+ Bian, Jiang and
83
+ Kleesiek, Jens and
84
+ Wu, Yonghui",
85
+ editor = "Bouamor, Houda and
86
+ Pino, Juan and
87
+ Bali, Kalika",
88
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
89
+ month = dec,
90
+ year = "2023",
91
+ address = "Singapore",
92
+ publisher = "Association for Computational Linguistics",
93
+ url = "https://aclanthology.org/2023.findings-emnlp.922",
94
+ doi = "10.18653/v1/2023.findings-emnlp.922",
95
+ pages = "13801--13813",
96
+ abstract = "Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45{\%} over the previous state-of-the-art.",
97
  }
98
  ```
99
  ## Contact