Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 40
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published 24 days ago • 75
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents Paper • 2306.16527 • Published Jun 21, 2023 • 44
XTREME-S: Evaluating Cross-lingual Speech Representations Paper • 2203.10752 • Published Mar 21, 2022 • 1