Papers
arxiv:2406.20094

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Published on Jun 28
· Submitted by xywang1 on Jul 1
#1 Paper of the day
Authors:
,

Abstract

We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.

Community

Paper author Paper submitter
edited 11 days ago

Update: We added the demo code for synthesizing data with Persona Hub at our Github page:

https://github.com/tencent-ailab/persona-hub

Download our released data through Hugging Face: https://huggingface.co/datasets/proj-persona/PersonaHub

image.png

An illustration to demonstrate how to use Persona Hub to create synthetic data for various data synthesis scenarios

image.png

Persona-driven data synthesis from the perspective of compression

This comment has been hidden

Great work! This may be the first work that touches scalable synthetic data creation (has the potential of creating trillion-level tokens). What about the experiment of knowledge-rich texts (Section 4.4) ? Is there any performance gains on widely-used benchmarks such as MMLU, BBH, etc?

·

Thank you for your interest. The performance gain depends on the performance of the model used for generating synthetic data. As we discussed in Section 5, the more powerful the model, the higher the quality of the generated data, and the greater the potential gain.

Update: We added the demo code for synthesizing data with Persona Hub at our Github page:

https://github.com/tencent-ailab/persona-hub

Feel free to try the demo to create synthetic data with personas :)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.20094 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.20094 in a Space README.md to link it from this page.

Collections including this paper 15