Papers
arxiv:2306.00168

Measuring the Robustness of Natural Language Processing Models to Domain Shifts

Published on May 31, 2023
Authors:
,
,
,
,

Abstract

Existing research on Domain Robustness (DR) suffers from disparate setups, lack of evaluation task variety, and reliance on challenge sets. In this paper, we pose a fundamental question: What is the state of affairs of the DR challenge in the era of Large Language Models (LLMs)? To this end, we construct a DR benchmark comprising diverse NLP tasks, including sentence and token-level classification, QA, and generation, each task consists of several domains. We explore the DR challenge of fine-tuned and few-shot learning models in natural domain shift settings and devise two diagnostic metrics of Out-of-Distribution (OOD) performance degradation: The commonly used Source Drop (SD) and the overlooked Target Drop (TD). Our findings reveal important insights: First, despite their capabilities, zero-to-few shot LLMs and fine-tuning approaches still fail to meet satisfactory performance in the OOD context; Second, TD approximates better than SD the average OOD degradation; Third, in a significant proportion of domain shifts, either SD or TD is positive, but not both, and therefore disregarding one can lead to incorrect DR conclusions.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.00168 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.00168 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.00168 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.