MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Abstract
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
Community
More than 950,000 hours of speech on the 24 EU languages are open-source and available on GitHub and pseudolabels for more than 440,000 hours are already available on HuggingFace!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MooER: LLM-based Speech Recognition and Translation Models from Moore Threads (2024)
- Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond (2024)
- Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling (2024)
- A Large Dataset of Spontaneous Speech with the Accent Spoken in S~ao Paulo for Automatic Speech Recognition Evaluation (2024)
- The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper