The Hub is home to an extensive collection of community-curated and popular research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away!
Start by creating a Hugging Face Hub account if you don’t have one yet.
The Hub’s web-based interface allows users without any developer experience to upload a dataset.
A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible.
- Click on your profile and select New Dataset to create a new dataset repository.
- Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization.
- Once you’ve created a repository, navigate to the Files and versions tab to add a file. Select Add file to upload your dataset files. We currently support the following data formats: CSV, JSON, JSON lines, text, and Parquet. For this tutorial, you can use the following sample CSV files: train.csv, test.csv.
- Drag and drop your dataset files and add a brief descriptive commit message.
- After uploading your dataset files, they are stored in your dataset repository.
Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly.
- Click on Create Dataset Card to create a Dataset card. This button creates a
README.mdfile in your repository.
Feel free to copy this Dataset card template to help you fill out all the relevant fields.
The Dataset card uses structured tags to help users discover your dataset on the Hub. Use the Dataset Tagger to help you generate the appropriate tags.
Copy the generated tags, paste them at the top of your Dataset card, and then commit your changes.
For a detailed example of what a good Dataset card should look like, take a look at the CNN DailyMail Dataset card.
Once your dataset is stored on the Hub, anyone can load it with the load_dataset() function:
from datasets import load_dataset dataset = load_dataset("stevhliu/demo")
Users who prefer to upload a dataset programmatically can use the huggingface_hub library. This library allows users to interact with the Hub from Python.
- Begin by installing the library:
pip install huggingface_hub
- To upload a dataset on the Hub in Python, you need to log in to your Hugging Face account:
- Use the
push_to_hub()function to help you add, commit, and push a file to your repository:
from datasets import load_dataset dataset = load_dataset("stevhliu/demo") # dataset = dataset.map(...) # do all your processing here dataset.push_to_hub("stevhliu/processed_demo")
To set your dataset as private, set the
private parameter to
True. This parameter will only work if you are creating a repository for the first time.
A private dataset is only accessible by you. Similarly, if you share a dataset within your organization, then members of the organization can also access the dataset.
Load a private dataset by providing your authentication token to the
from datasets import load_dataset # Load a private individual dataset dataset = load_dataset("stevhliu/demo", use_auth_token=True) # Load a private organization dataset dataset = load_dataset("organization/dataset_name", use_auth_token=True)
Congratulations, you’ve completed the tutorials! 🥳
From here, you can go on to:
- Learn more about how to use 🤗 Datasets other functions to process your dataset.
- Stream large datasets and avoid waiting for the entire dataset to download.
- Write a dataset loading script and share your dataset with the community.
If you have any questions about 🤗 Datasets, feel free to join and ask the community on our forum.