Papers
arxiv:2202.03799

What are the best systems? New perspectives on NLP Benchmarking

Published on Feb 8, 2022
Authors:
,
,
,

Abstract

In Machine Learning, a benchmark refers to an ensemble of <PRE_TAG>datasets</POST_TAG> associated with one or multiple <PRE_TAG>metrics</POST_TAG> together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new <PRE_TAG>datasets</POST_TAG> and <PRE_TAG>metrics</POST_TAG>, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the <PRE_TAG>metrics</POST_TAG> are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-<PRE_TAG>aggregation procedure</POST_TAG> while being both more reliable and robust.

Community

Sign up or log in to comment

Models citing this paper 7

Browse 7 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2202.03799 in a dataset README.md to link it from this page.

Spaces citing this paper 9

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.