arxiv:2412.14689

How to Synthesize Text Data without Model Collapse?

Published on Dec 19, 2024

· Submitted by

daixuancheng on Dec 20, 2024

Upvote

Authors:

Xuekai Zhu ,

Daixuan Cheng ,

Hengli Li ,

Kaiyan Zhang ,

Ermo Hua ,

Xingtai Lv ,

Ning Ding ,

Zilong Zheng ,

Bowen Zhou

Abstract

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.

View arXiv page View PDF Add to collection

Community

daixuancheng

Paper author Paper submitter Dec 20, 2024

As generative artificial intelligence (AI) becomes increasingly prevalent in research and industry, synthetic data will proliferate throughout the web data ecosystem. Consequently, future training of GPT-n on a mixture of synthetic and human-produced data will be inevitable. Thus, model collapse is a critical concern that must be considered when training models on synthetic data.

In this paper, we focus on two questions:
(Q1): What is the impact of synthetic data on language model training,
(Q2): How to synthesize data without model collapse?

librarian-bot

Dec 21, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ucalyptus

Dec 25, 2024

Here are the 15 key points summarizing the paper:

Model collapse occurs when training on synthetic data, leading to performance degradation. This is becoming a critical concern as future AI models will inevitably be trained on mixed synthetic and human-produced data.
The researchers found a negative correlation between the proportion of synthetic data used and model performance - more synthetic data led to worse results in pre-training language models.
Statistical analysis revealed that synthetic data suffers from "coverage collapse" - it only covers a small portion of the human data distribution and lacks long-tail features.
Synthetic data shows over-concentration of n-gram features compared to human-produced data, making it less diverse and comprehensive.
Traditional data selection methods cannot effectively mitigate the distribution shift between synthetic and human data.
The researchers propose "token-level editing" as a solution, which selectively modifies tokens in human-produced data rather than generating entirely synthetic data.
They provide theoretical proof that token-level editing prevents model collapse by maintaining a finite upper bound on test error.
The method was validated through experiments across three training scenarios: pre-training from scratch, continual pre-training, and supervised fine-tuning.
Results showed consistent improvements in model performance when using token-edited data compared to both pure synthetic data and original human data.
The token editing process uses a prior distribution (pre-trained language model) to identify which tokens to modify based on their conditional probabilities.
The approach is computationally efficient, requiring only a single forward pass rather than autoregressive generation.
The researchers distinguish between "pure synthetic data" (generated entirely by AI) and "semi-synthetic data" (modified human data), with the latter showing better results.
The work identifies two critical questions for future research: understanding synthetic data's impact on language model training and preventing model collapse.
The findings suggest that maintaining distribution coverage and preserving long-tail features are crucial for successful use of synthetic data in training.
The research demonstrates that synthetic data can be beneficial when properly integrated with human data, but pure synthetic data risks model collapse and performance degradation.

xuekai

Paper author Dec 26, 2024

Excellent summary! I’ve reviewed the content, and it effectively conveys our results and key findings with clarity.
Thank you for your attention to detail! We’re open to further questions and discussions.