Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 21
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper • 2306.01116 • Published Jun 1, 2023 • 28