Hugging Face TB Research's profile picture

Hugging Face TB Research


AI & ML interests

Exploring synthetic datasets, generated by Large Language Models (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)

Organization Card
About org cards


This is the home of synthetic datasets for pre-training, such as Cosmopedia. We're trying to scale synthetic data generation by curating diverse prompts that cover a wide range of topics and efficiently scaling the generations on GPUs with tools like llm-swarm.

We recently released:

  • Cosmopedia: the largest open synthetic dataset, with 25B tokens and more than 30M samples. It contains synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.
  • Cosmo-1B a 1B model trained on Cosmopedia.

For more details check our blogpost: