Structure your repository

To host and share your dataset, you can create a dataset repository on the Hugging Face Dataset Hub and upload your data files.

This guide will show you how to structure your dataset repository when you upload it. A dataset with a supported structure can be loaded automatically with load_dataset(), and it will have a preview on its dataset page on the Hub.

Note that you can also include a python script to define your dataset, for more flexibility.

The following examples use CSV files, but you can use any supported format (text, JSON, JSON Lines, CSV, Parquet).

Main use-case

The simplest dataset structure has two files: train.csv and test.csv.

Your repository will also contain a README.md file, the dataset card displayed on your dataset page.

my_dataset_repository/
├── README.md
├── train.csv
└── test.csv

Splits and file names

🤗 Datasets automatically infers the train/validation/test splits of your dataset from the file names. All the files that contain train in their names are considered part of the train split. The same idea applies to the test and validation split:

All the files that contain test in their names are considered part of the test split.
All the files that contain valid in their names are considered part of the validation split.

Here is an example where all the files are placed into a directory named data:

my_dataset_repository/
├── README.md
└── data/
    ├── train.csv
    ├── test.csv
    └── valid.csv

Multiple files per split

If one of your splits comprises several files, 🤗 Datasets can still infer whether it is the train/validation/ test split from the file name. For example, if your train and test splits span several files:

my_dataset_repository/
├── README.md
├── train_0.csv
├── train_1.csv
├── train_2.csv
├── train_3.csv
├── test_0.csv
└── test_1.csv

Just make sure that all the files of your train set have train in their names (same for test and validation). It doesn’t matter if you add a prefix or suffix to train in the file name (like my_train_file_00001.csv, for example). 🤗 Datasets can still infer the appropriate split.

For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.

my_dataset_repository/
├── README.md
└── data/
    ├── train/
    │   ├── shard_0.csv
    │   ├── shard_1.csv
    │   ├── shard_2.csv
    │   └── shard_3.csv
    └── test/
        ├── shard_0.csv
        └── shard_1.csv

Custom split names

If you have other data files in addition to the traditional train/validation/test sets, you must use the following structure. Follow the file name format exactly for this type of structure: data/<split_name>-xxxxx-of-xxxxx.csv. Here is an example with three splits: train, test, and random:

my_dataset_repository/
├── README.md
└── data/
    ├── train-00000-of-00003.csv
    ├── train-00001-of-00003.csv
    ├── train-00002-of-00003.csv
    ├── test-00000-of-00001.csv
    ├── random-00000-of-00003.csv
    ├── random-00001-of-00003.csv
    └── random-00002-of-00003.csv

Multiple configuration (WIP)

You can specify different configurations of your dataset (for example, if a dataset contains different languages) with one directory per configuration.

These structures are not supported yet, but are a work in progress:

my_dataset_repository/
├── README.md
├── en/
│   ├── train.csv
│   └── test.csv
└── fr/
    ├── train.csv
    └── test.csv

Or with one directory per split:

my_dataset_repository/
├── README.md
├── en/
│   ├── train/
│   │   ├── shard_0.csv
│   │   └── shard_1.csv
│   └── test/
│       ├── shard_0.csv
│       └── shard_1.csv
└── fr/
    ├── train/
    │   ├── shard_0.csv
    │   └── shard_1.csv
    └── test/
        ├── shard_0.csv
        └── shard_1.csv