Papers
arxiv:2010.12741

An Evaluation Protocol for Generative Conversational Systems

Published on Oct 24, 2020
Authors:
,
,

Abstract

There is a multitude of novel generative models for open-domain conversational systems; however, there is no systematic evaluation of different systems. Systematic comparisons require consistency in experimental design, evaluation sets, conversational systems and their outputs, and statistical analysis. We lay out a protocol for the evaluation of conversational models using head-to-head pairwise comparison. We analyze ten recent models that claim state-of-the-art performance using a paired head-to-head performance (win-loss-tie) on five evaluation datasets. Our findings show that DialoGPT and Blender are superior systems using Bradley-Terry model and TrueSkill ranking methods. These findings demonstrate the feasibility of our protocol to evaluate conversational agents and evaluation sets. Finally, we make all code and evaluations publicly available for researchers to compare their model to other state-of-the-art dialog models.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2010.12741 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2010.12741 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.