Share your open ML datasets on Hugging Face Hub!

Published November 12, 2024

Update on GitHub

Upvote

davanstrien Daniel van Strien

cfahlgren1 Caleb Fahlgren

lhoestq Quentin Lhoest

erinys Ann Huang

If you're working on data-intensive research or machine learning projects, you need a reliable way to share and host your datasets. Public datasets such as Common Crawl, ImageNet, Common Voice and more are critical to the open ML ecosystem, yet they can be challenging to host and share.

Hugging Face Hub makes it seamless to host and share datasets, trusted by many leading research institutions, companies, and government agencies, including Nvidia, Google, Stanford, NASA, THUDM and Barcelona Supercomputing Center.

By hosting a dataset on the Hugging Face Hub, you get instant access to features that can maximize your work's impact:

Generous Limits
Dataset Viewer
Third Party Library Support
SQL Console
Security
Reach and Visibility

Generous Limits

Support for large datasets

The Hub can host terabyte-scale datasets, with high per-file and per-repository limits. If you have data to share, the Hugging Face datasets team can help suggest the best format for uploading your data for community usage. The 🤗 Datasets library makes it easy to upload and download your files, or even create a dataset from scratch. 🤗 Datasets also enables dataset streaming , making it possible to work with large datasets without needing to download the entire thing. This can be invaluable to allow researchers with less computational resources to work with your datasets, or to select small portions of a huge dataset for testing, development or prototyping.

Screenshot of the file size information for a dataset
The Hugging Face Hub can host the large datasets often created for machine learning research.

Note: The Xet team is currently working on a backend update that will increase per-file limits from the current 50 GB to 500 GB while also improving storage and transfer efficiency.

Dataset Viewer

Beyond just hosting your data, the Hub provides powerful tools for exploration. With the Datasets Viewer, users can explore and interact with datasets hosted on the Hub directly in their browser. This provides an easy way for others to view and explore your data without downloading it first.

Hugging Face datasets supports many different modalities (audio, images, video, etc.) and file formats (CSV, JSON, Parquet, etc.), and compression formats (Gzip, Zip, etc.). Check out the Datasets File Formats page for more details.

Screenshot of the Datasets Viewer
The Dataset Viewer for the Infinity-Instruct dataset.

The Datasets Viewer also includes a few features which make it easier to explore a dataset.

Full Text Search

Built-in Full Text Search is one of the most powerful features of the Datasets Viewer. Any text columns in a dataset immediately become searchable.

The Arxiver dataset contains 63.4k rows of arXiv research papers converted to Markdown. By using Full Text Search, it's easy to find the papers containing a specific author such as Ilya Sutskever below.

Sorting

The Datasets Viewer allows you to sort the dataset by clicking on the column headers. This makes it easy to find the most relevant examples in a dataset.

Below is an example of a dataset sorted by the helpfulness column in descending order for the HelpSteer2 dataset.

Third Party Library Support

Hugging Face is fortunate to have third party integrations with the leading open source data tools. By hosting a dataset on the Hub, it instantly makes the dataset compatible with the tools users are most familiar with.

Here are some of the libraries Hugging Face supports out of the box:

Library	Description	Monthly PyPi Downloads (2024)
Pandas	Python data analysis toolkit.	258M
Spark	Real-time, large-scale data processing tool in a distributed environment.	29M
Datasets	🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP).	17M
Dask	Parallel and distributed computing library that scales the existing Python and PyData ecosystem.	12M
Polars	A DataFrame library on top of an OLAP query engine.	8.5M
DuckDB	In-process SQL OLAP database management system.	6M
WebDataset	Library to write I/O pipelines for large datasets.	871K
Argilla	Collaboration tool for AI engineers and domain experts that value high quality data.	400k

Most of these libraries enable you to load or stream a dataset in 1 single line of code.

Here are some examples with Pandas, Polars and DuckDB:

# Pandas example
import pandas as pd
df = pd.read_parquet("hf://datasets/neuralwork/arxiver/data/train.parquet")

# Polars example 
import polars as pl
df = pl.read_parquet("hf://datasets/neuralwork/arxiver/data/train.parquet")

# DuckDB example - SQL queries
import duckdb
duckdb.sql("SELECT * FROM 'hf://datasets/neuralwork/arxiver/data/train.parquet' LIMIT 10")

You can find more information about integrated libraries in the Datasets documentation. Along with the libraries listed above, there are many more community supported tools which support the Hugging Face Hub such as Lilac and Spotlight.

SQL Console

The SQL Console provides an interactive SQL editor that runs entirely in your browser, enabling instant data exploration without any setup. Key features include:

One-Click: Open a SQL Console to query a dataset with a single click
Shareable and Embeddable Results: Share and embed interesting query results
Full DuckDB Syntax: Use full SQL syntax with built-in functions for regex, lists, JSON, embeddings, and more

On every public dataset you should see a new SQL Console badge. With just one click you can open a SQL Console to query that dataset.

Querying the Magpie-Ultra dataset for excellent, high quality reasoning instructions.

Security

While making datasets accessible is important, protecting sensitive data is equally crucial. The Hugging Face Hub provides robust security features to help you maintain control over your data while sharing it with the right audiences.

Access Controls

The Hugging Face Hub supports unique access control options for who has access to the dataset.

Public: Anyone can access the dataset.
Private: Only you and those in your organization can access the dataset.
Gated: Control access to your dataset through two options:
- Automatic Approval: Users must provide required information (like name and email) and agree to terms before gaining access
- Manual Approval: You review and manually approve/reject each access request

For more details about gated datasets, see the gated datasets documentation. For more fine-grained controls, there are Enterprise plan features where organizations can create resource security groups, use SSO, and more.

Built-in Security Scanning

Along with access controls, the Hugging Face Hub offers several security scanners:

Feature	Description
Malware Scanning	Scans files for malware and suspicious content at each commit and visit
Secrets Scanning	Blocks datasets with hardcoded secrets and environment variables
Pickle Scanning	Scans pickle files and shows vetted imports for PyTorch weights
ProtectAI	Uses Guardian tech to block datasets with pickle, Keras and other exploits

Security scanner status banner showing various security checks — To learn more about these scanners, see the security scanners documentation.

Reach and Visibility

Having a secure platform with powerful features is valuable, but the true impact of research comes from reaching the right audience. Reach and visibility are crucial for researchers sharing datasets - it helps maximize research impact, enables reproducibility, facilitates collaboration, and ensures valuable data can benefit the broader scientific community.

With over 5M builders actively using the platform, the Hugging Face Hub provides researchers with powerful tools for community engagement and visibility. Here's what you can expect:

Better Community Engagement

Built-in discussion tabs for each dataset for community engagement
Organizations as a centralized place for grouping and collaborating on multiple datasets
Metrics for dataset usage and impact

Wider Reach

Access to a large, active community of researchers, developers, and practitioners
SEO-optimized URLs making your dataset easily discoverable
Integration with the broader ecosystem of models, datasets, and libraries
Clear links between your dataset and related models, papers, and demos

Improved Documentation

Customizable README files for comprehensive documentation
Support for detailed dataset descriptions and proper academic citations
Links to related research papers and publications

Screenshot of a discussion for a dataset on the Hub.
The Hub makes it easy to ask questions and discuss datasets.

How can I host my dataset on the Hugging Face Hub?

Now that you understand the benefits of hosting your dataset on the Hub, you might be wondering how to get started. Here are some comprehensive resources to guide you through the process:

General guidance on creating and sharing datasets on the Hub
Guides for particular modalities:
- Creating an audio dataset
- Creating an image dataset
- Creating a video dataset
Guidance on structuring your repository so a dataset can be automatically loaded from the Hub.

The following pages will be useful if you want to share large datasets:

Repository limitations and recommendations provides general guidance on some of the considerations you'll want to make when sharing large datasets.
The Tips and tricks for large uploads page provides some guidance on how to upload large datasets to the Hub.

If you want any further help uploading a dataset to the Hub or want to upload a particularly large dataset, please contact datasets@huggingface.co.

Evaluating Audio Reasoning with Big Bench Audio

By December 20, 2024 guest • 8

Finally, a Replacement for BERT: Introducing ModernBERT

By December 19, 2024 guest • 289

Upvote