Resources for Cosmopedia dataset
Hugging Face TB Research
Enterprise
community
AI & ML interests
Exploring synthetic datasets, generated by Large Language Models (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)
Organization Card
About org cards
HuggingFaceTB
This is the home of synthetic datasets for pre-training, such as Cosmopedia. We're trying to scale synthetic data generation by curating diverse prompts that cover a wide range of topics and efficiently scaling the generations on GPUs with tools like llm-swarm.
We recently released:
- Cosmopedia: the largest open synthetic dataset, with 25B tokens and more than 30M samples. It contains synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.
- Cosmo-1B a 1B model trained on Cosmopedia.
For more details check our blogpost: https://huggingface.co/blog/cosmopedia
Collections
1
spaces
1
datasets
17
HuggingFaceTB/cosmopedia_web_textbooks_all_2B
Updated
•
1
HuggingFaceTB/cosmopedia_2B_annotated_edu_score
Viewer
•
Updated
HuggingFaceTB/cosmopedia
Viewer
•
Updated
•
4.06k
•
468
HuggingFaceTB/wiki_applied_sciences_college_students_1k
Viewer
•
Updated
HuggingFaceTB/wiki_natural_sciences_college_high_school_students_1k
Viewer
•
Updated
HuggingFaceTB/images
Viewer
•
Updated
HuggingFaceTB/bisac-topics
Viewer
•
Updated
•
16
HuggingFaceTB/web_under_line_mean_100
Viewer
•
Updated
•
7
HuggingFaceTB/cosmopedia_6M
Viewer
•
Updated
•
28
•
3
HuggingFaceTB/cosmopedia-20k
Viewer
•
Updated
•
1.15k