Papers
arxiv:2305.02606

Re^3Dial: Retrieve, Reorganize and Rescale Dialogue Corpus for Long-Turn Open-Domain Dialogue Pre-training

Published on May 4, 2023
Authors:
,

Abstract

Large-scale open-domain dialogue data crawled from public social media has greatly improved the performance of dialogue models. However, long-turn dialogues are still highly scarce. Specifically, most dialogue sessions in existing corpora have less than three turns. To alleviate this issue, we propose the Retrieve, Reorganize and Rescale framework (Re^3Dial), which can automatically construct a billion-scale long-turn dialogue corpus from existing short-turn dialogue data. Re^3Dial first trains an Unsupervised Dense Session Retriever (UDSR) to capture semantic and discourse relationships within multi-turn dialogues for retrieving relevant and coherent sessions. It then reorganizes the short-turn dialogues into long-turn sessions via recursively retrieving and selecting the consecutive sessions with our proposed diversity sampling strategy. Extensive evaluations on multiple multi-turn dialogue benchmarks demonstrate that Re^3Dial consistently and significantly improves the dialogue model's ability to utilize long-term context for modeling multi-turn dialogues across different pre-training settings. Finally, we build a toolkit for efficiently rescaling dialogue corpus with Re^3Dial, which enables us to construct a corpus containing 1B Chinese dialogue sessions with 11.3 turns on average (5X longer than the original EVA corpus). We will release our UDSR model, toolkit, and data for public use.

Community

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2305.02606 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2305.02606 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.