Datasets documentation

Share a dataset to the Hub

You are viewing v2.7.0 version. A newer version v2.18.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Share a dataset to the Hub

The Hub is home to an extensive collection of community-curated and popular research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away!

Start by creating a Hugging Face Hub account if you don’t have one yet.

Upload with the Hub UI

The Hub’s web-based interface allows users without any developer experience to upload a dataset.

Create a repository

A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible.

  1. Click on your profile and select New Dataset to create a new dataset repository.
  2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization.

Upload dataset

  1. Once you’ve created a repository, navigate to the Files and versions tab to add a file. Select Add file to upload your dataset files. We currently support the following data formats: CSV, JSON, JSON lines, text, and Parquet. For this tutorial, you can use the following sample CSV files: train.csv, test.csv.
  1. Drag and drop your dataset files and add a brief descriptive commit message.
  1. After uploading your dataset files, they are stored in your dataset repository.

Create a Dataset card

Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly.

  1. Click on Create Dataset Card to create a Dataset card. This button creates a README.md file in your repository.
  1. Feel free to copy this Dataset card template to help you fill out all the relevant fields.

  2. The Dataset card uses structured tags to help users discover your dataset on the Hub. Use the Dataset Tagger to help you generate the appropriate tags.

  3. Copy the generated tags, paste them at the top of your Dataset card, and then commit your changes.

For a detailed example of what a good Dataset card should look like, take a look at the CNN DailyMail Dataset card.

Load dataset

Once your dataset is stored on the Hub, anyone can load it with the load_dataset() function:

>>> from datasets import load_dataset

>>> dataset = load_dataset("stevhliu/demo")

Upload with Python

Users who prefer to upload a dataset programmatically can use the huggingface_hub library. This library allows users to interact with the Hub from Python.

  1. Begin by installing the library:
pip install huggingface_hub
  1. To upload a dataset on the Hub in Python, you need to log in to your Hugging Face account:
huggingface-cli login
  1. Use the push_to_hub() function to help you add, commit, and push a file to your repository:
>>> from datasets import load_dataset

>>> dataset = load_dataset("stevhliu/demo")
# dataset = dataset.map(...)  # do all your processing here
>>> dataset.push_to_hub("stevhliu/processed_demo")

To set your dataset as private, set the private parameter to True. This parameter will only work if you are creating a repository for the first time.

>>> dataset.push_to_hub("stevhliu/private_processed_demo", private=True)

Privacy

A private dataset is only accessible by you. Similarly, if you share a dataset within your organization, then members of the organization can also access the dataset.

Load a private dataset by providing your authentication token to the use_auth_token parameter:

>>> from datasets import load_dataset

# Load a private individual dataset
>>> dataset = load_dataset("stevhliu/demo", use_auth_token=True)

# Load a private organization dataset
>>> dataset = load_dataset("organization/dataset_name", use_auth_token=True)

What's next?

Congratulations, you’ve completed the tutorials! 🥳

From here, you can go on to:

If you have any questions about 🤗 Datasets, feel free to join and ask the community on our forum.