A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos
![Hugging Face TB Research's profile picture](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
Hugging Face TB Research
Enterprise
community
AI & ML interests
Exploring synthetic datasets, generated by Large Language Models (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)
Organization Card
HuggingFaceTB
This is the home of synthetic datasets for pre-training, such as Cosmopedia. We're trying to scale synthetic data generation by curating diverse prompts that cover a wide range of topics and efficiently scaling the generations on GPUs with tools like llm-swarm.
We recently released:
- Cosmopedia: the largest open synthetic dataset, with 25B tokens and more than 30M samples. It contains synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.
- Cosmo-1B a 1B model trained on Cosmopedia.
For more details check our blogpost: https://huggingface.co/blog/cosmopedia
models
9
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/SmolLM-1.7B-Instruct
Text Generation
β’
Updated
β’
26.2k
β’
60
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/SmolLM-135M-Instruct
Text Generation
β’
Updated
β’
21.8k
β’
42
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/SmolLM-360M-Instruct
Text Generation
β’
Updated
β’
24.6k
β’
22
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/SmolLM-135M
Text Generation
β’
Updated
β’
48k
β’
74
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/SmolLM-1.7B
Text Generation
β’
Updated
β’
29.6k
β’
109
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/SmolLM-360M
Text Generation
β’
Updated
β’
24.5k
β’
30
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/python-edu-scorer
Text Classification
β’
Updated
β’
141
β’
10
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/cosmo-1b
Text Generation
β’
Updated
β’
1.75k
β’
123
![](https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/vFZcFbFHaXeaHSyLkZ0Od.png)
HuggingFaceTB/bisac_34M_1k
Updated
datasets
20
HuggingFaceTB/smollm-corpus
Viewer
β’
Updated
β’
237M
β’
2.69k
β’
121
HuggingFaceTB/images
Viewer
β’
Updated
β’
32
β’
2
HuggingFaceTB/sample_log_probs
Viewer
β’
Updated
β’
20k
β’
7
β’
2
HuggingFaceTB/cosmopedia_stanford_openstax_wiki_1k
Viewer
β’
Updated
β’
3k
β’
44
HuggingFaceTB/cosmopedia_web_textbooks_all_2B
Updated
β’
1
HuggingFaceTB/cosmopedia
Viewer
β’
Updated
β’
31.1M
β’
8.03k
β’
530
HuggingFaceTB/wiki_applied_sciences_college_students_1k
Viewer
β’
Updated
β’
1k
β’
1
HuggingFaceTB/wiki_natural_sciences_college_high_school_students_1k
Viewer
β’
Updated
β’
1k
β’
8
β’
1
HuggingFaceTB/bisac-topics
Viewer
β’
Updated
β’
5.5k
β’
1
β’
1
HuggingFaceTB/web_under_line_mean_100
Viewer
β’
Updated
β’
1.16k
β’
5