Language models, Domain adaptation, Question Answering, Neural Search
Team members 6
In order to raise the bar for non-English QA, we are releasing a high-quality, human-labeled German QA dataset consisting of 13 722 questions, incl. a three-way annotated test set. The creation of GermanQuAD is inspired by insights from existing datasets as well as our labeling experience from several industry projects. We combine the strengths ...
We take GermanQuAD as a starting point and add hard negatives from a dump of the full German Wikipedia following the approach of the DPR authors (Karpukhin et al., 2020). The format of the dataset also resembles the one of DPR. GermanDPR comprises 9275 question/answer pairs in the training set and 1025 pairs in the test set. For each pair, there...