Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Abstract
Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.
Community
Extended version of Emilia, submitted to TASLP. The initial 101k hours version of Emilia has already been open-sourced on HuggingFace: https://huggingface.co/datasets/amphion/Emilia-Dataset.
Now, we are releasing an extended version over 250k hours of speech data!!! Coming soon on HuggingFace!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Autoregressive Speech Synthesis with Next-Distribution Prediction (2024)
- Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey (2024)
- OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia (2025)
- Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers (2024)
- OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis (2025)
- TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch (2024)
- Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper