Share¶

At Hugging Face, we are on a mission to democratize NLP and we believe in the value of open source. That’s why we designed 🤗 Datasets so that anyone can share a dataset with the greater NLP community. There are currently over 900 datasets in over 100 languages in the Hugging Face Hub, and the Hugging Face team always welcomes new contributions!

This guide will show you how to share a dataset that can be easily accessed by anyone.

There are two options to share a new dataset:

Directly upload it on the Hub as a community provided dataset.
Add it as a canonical dataset by opening a pull-request on the GitHub repository for 🤗 Datasets.

Community vs. canonical¶

Both options offer the same features such as:

Dataset versioning
Commit history and diffs
Metadata for discoverability
Dataset cards for documentation, licensing, limitations, etc.

The main differences between the two are highlighted in the table below:

Community datasets	Canonical datasets
Faster to share, no review process.	Slower to add, needs to be reviewed.
Data files can be stored on the Hub.	Data files are typically retrieved from the original host URLs.
Identified by a user or organization namespace like thomwolf/my_dataset or huggingface/our_dataset.	Identified by a root namespace. Need to select a short name that is available.
Requires data files and/or a dataset loading script.	Always requires a dataset loading script.
Flagged as unsafe because the dataset contains executable code.	Flagged as safe because the dataset has been reviewed.

For community datasets, if your dataset is in a supported format, you can skip directly below to learn how to upload your files and add a dataset card. There is no need to write your own dataset loading script (unless you want more control over how to load your dataset). However, if the dataset isn’t in one of the supported formats, you will need to write a dataset loading script. The dataset loading script is a Python script that defines the dataset splits, feature types, and how to download and process the data.

On the other hand, a dataset script is always required for canonical datasets.

Important

The distinction between a canonical and community dataset is based solely on the selected sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.

Add a community dataset¶

You can share your dataset with the community with a dataset repository on the Hugging Face Hub. In a dataset repository, you can either host all your data files and/or use a dataset script.

The dataset script is optional if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet. The script also supports many kinds of compressed file types such as: GZ, BZ2, LZ4, LZMA or ZSTD. For example, your dataset can be made of .json.gz files.

On the other hand, if your dataset is not in a supported format or if you want more control over how your dataset is loaded, you can write your own dataset script.

When loading a dataset from the Hub:

If there’s no dataset script, all the files in the supported formats are loaded.
If there’s a dataset script, it is downloaded and executed to download and prepare the dataset.

For more information on how to load a dataset from the Hub, see how to load from the Hugging Face Hub.

Create the repository¶

Sharing a community dataset will require you to create an account on hf.co if you don’t have one yet. You can directly create a new dataset repository from your account on the Hugging Face Hub, but this guide will show you how to upload a dataset from the terminal.

Make sure you are in the virtual environment where you installed Datasets, and run the following command:

huggingface-cli login

Login using your Hugging Face Hub credentials, and create a new dataset repository:

huggingface-cli repo create your_dataset_name --type dataset

Add the -organization flag to create a repository under a specific organization:

huggingface-cli repo create your_dataset_name --type dataset --organization your-org-name

Clone the repository¶

Install Git LFS and clone your repository:

# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install

git clone https://huggingface.co/datasets/namespace/your_dataset_name

Here the namespace is either your username or your organization name.

Prepare your files¶

Now is a good time to check your directory to ensure the only files you’re uploading are:

README.md is a Dataset card that describes the datasets contents, creation, and usage. To write a Dataset card, see the dataset card page.
The raw data files of the dataset (optional, if they are hosted elsewhere you can specify the URLs in the dataset script).
your_dataset_name.py is your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). To create a dataset script, see the dataset script page.
dataset_infos.json contains metadata about the dataset (required only if you have a dataset script).

Upload your files¶

You can directly upload your files from your repository on the Hugging Face Hub, but this guide will show you how to upload the files from the terminal.

It is important to add the large data files first with git lfs track or else you will encounter an error later when you push your files:

cp /somewhere/data/*.json .
git lfs track *.json
git add .gitattributes
git add *.json
git commit -m "add json files"

Add the dataset loading script and metadata file:

cp /somewhere/data/dataset_infos.json .
cp /somewhere/data/load_script.py .
git add --all

Verify the files have been correctly staged. Then you can commit and push your files:

git status
git commit -m "First version of the your_dataset_name dataset."
git push

Congratulations, your dataset has now been uploaded to the Hugging Face Hub where anyone can load it in a single line of code! 🥳

dataset = load_dataset("namespace/your_dataset_name")

Add a canonical dataset¶

Canonical datasets are dataset scripts hosted in the GitHub repository of the 🤗 Dataset library. The code of these datasets are reviewed by the Hugging Face team, and they require test data in order to be regularly tested.

Clone the repository¶

To share a canonical dataset:

Fork the 🤗 Datasets repository by clicking on the Fork button.
Clone your fork to your local disk, and add the base repository as a remote:

git clone https://github.com/<your_GitHub_handle>/datasets
cd datasets
git remote add upstream https://github.com/huggingface/datasets.git

Prepare your files¶

Create a new branch to hold your changes. You can name the new branch using the short name of your dataset:

git checkout -b my-new-dataset

Set up a development environment by running the following command in a virtual environment:

pip install -e ".[dev]"

Create a new folder with the dataset name inside huggingface/datasets, and add the dataset loading script. To create a dataset script, see the dataset script page.
Check your directory to ensure the only files you’re uploading are:

README.md is a Dataset card that describes the datasets contents, creation, and usage. To write a Dataset card, see the dataset card page.
your_dataset_name.py is your dataset loading script.
dataset_infos.json contains metadata about the dataset.
dummy folder with dummy_data.zip files that hold a small subset of data from the dataset for tests and preview.

Run Black and isort to tidy up your code and files:

make style
make quality

Add your changes, and make a commit to record your changes locally. Then you can push the changes to your account:

git add datasets/<my-new-dataset>
git commit
git push -u origin my-new-dataset

Go back to your fork on GitHub, and click on Pull request to open a pull request on the main 🤗 Datasets repository for review.