HuggingFaceFW
Enterprise
community
AI & ML interests
None defined yet.
Organization Card
About org cards
🤗 HuggingFace 🍷 FineWeb datasets
Read our technical report!
This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).
The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the 🤗 libraries datatrove
, nanotron
or lighteval
.
Version 1 of the 🍷 FineWeb dataset is available here. Our ablation models can be found here.
Collections
4
models
9
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation
•
Updated
•
2.8k
•
5
HuggingFaceFW/fineweb-edu-classifier
Text Classification
•
Updated
•
1.37M
•
57
HuggingFaceFW/ablation-model-redpajama2
Text Generation
•
Updated
•
323
HuggingFaceFW/ablation-model-the-pile
Text Generation
•
Updated
•
451
•
1
HuggingFaceFW/ablation-model-slimpajama
Text Generation
•
Updated
•
503
•
1
HuggingFaceFW/ablation-model-dolma-v1_6
Text Generation
•
Updated
•
80
•
1
HuggingFaceFW/ablation-model-c4
Text Generation
•
Updated
•
524
•
2
HuggingFaceFW/ablation-model-refinedweb
Text Generation
•
Updated
•
570
•
2
HuggingFaceFW/ablation-model-fineweb-v1
Text Generation
•
Updated
•
3.66k
•
13